Claude Opus 4 launched on May 22, 2025 at 72.5% on SWE-bench Verified โ the highest coding score of any frontier model at the time, and the moment Claude became the default for production AI agents.
That's the short answer. The longer answer is more interesting โ because the Claude 4 release was less a single model drop than the start of a 12-month sprint that took the line from 72.5% to past 80% on the same benchmark while keeping the API surface frozen so enterprises could upgrade in one line of code.
The Anthropic Claude 4 release: what actually shipped
The Anthropic Claude 4 release on May 22, 2025 introduced two models โ Claude Opus 4 and Claude Sonnet 4 โ built around hybrid reasoning, where a single model can answer instantly or switch into extended step-by-step thinking. Opus 4 posted 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, the strongest agentic-coding numbers in the industry, and Anthropic positioned it explicitly as the model for long-running, multi-hour autonomous tasks rather than one-shot chat.
What made the release matter wasn't a single benchmark. It was the combination of three things shipping at once: extended thinking with tool use (the model can search, run code, and reason in the same turn), parallel tool calls, and far better instruction-following discipline โ the unglamorous trait that decides whether an agent finishes a 30-step task or derails at step 7. Anthropic also shipped memory files and the ability to use tools during reasoning, which is what turned Claude from a chatbot into a substrate for building agents.
Claude 4 release timeline and the tiers
The defining feature of the Claude 4 release was the cadence after it. Anthropic shipped incremental upgrades roughly every two to three months, each keeping the Opus / Sonnet / Haiku tier structure intact:
| Model | Released | SWE-bench Verified | Positioning |
|---|---|---|---|
| Claude Opus 4 | May 2025 | 72.5% | Flagship agentic coding |
| Claude Sonnet 4 | May 2025 | 72.7% | Balanced workhorse |
| Claude Sonnet 4.5 | Sep 2025 | 77.2% | Best price/performance |
| Claude Opus 4.5 | Nov 2025 | 80.9% | Top-end reasoning + coding |
| Claude Haiku 4.5 | Oct 2025 | 73.3% | Fast, cheap, high-volume |
| Claude Opus 4.8 | early 2026 | 82%+ | Current flagship |
Benchmark figures are SWE-bench Verified at release; later models report higher scores with extended thinking and tool use enabled.
Three tiers, one purpose each. Opus is the heavyweight for complex agentic work. Sonnet is the workhorse โ most production traffic runs here because it captures roughly 90% of Opus capability at one-fifth the cost. Haiku is the speed-and-volume tier for classification, routing, and latency-sensitive paths. The Claude 4 release standardized this so cleanly that switching tiers is a one-token change in the model string.
Claude 4 benchmark results: where it leads
Benchmarks are noisy, but the pattern across the Claude 4 family is consistent: it wins on agentic, long-horizon, tool-using tasks and stays competitive on raw reasoning. The numbers that matter most to people actually building things:
SWE-bench Verified
72.5% โ 82%+
Real GitHub issue resolution โ the benchmark that predicts coding-agent reliability
Terminal-bench
43.2%+
Autonomous command-line task completion
Tau-bench (agentic)
Top of leaderboard
Multi-turn tool use in retail/airline scenarios
GPQA Diamond
~80%+
Graduate-level science reasoning
Context window
200K tokens
1M-token context available in beta for long-document work
Sustained autonomy
7+ hours
Documented uninterrupted agentic coding runs on Opus 4 line
The number I keep coming back to is the 7+ hour autonomous run. A model that resolves 72.5% of real SWE-bench issues is impressive; a model that can stay on task for a full workday without losing the thread is a different category of product. That endurance โ not the headline accuracy โ is why coding-agent companies standardized on Claude.
Claude 4 vs GPT-5: the head-to-head
OpenAI's GPT-5 is the natural comparison, and the honest take is that they trade the lead by task. Here's how they line up across the dimensions that decide which one a team actually deploys:
| Dimension | Claude 4 (Opus/Sonnet) | GPT-5 |
|---|---|---|
| Agentic coding | Leads โ 72.5%โ82% SWE-bench, 7hr+ runs | Strong but trails on long-horizon |
| Raw reasoning / math | Competitive (~80% GPQA) | Slight edge on hardest math |
| Instruction-following | Best-in-class discipline | Good, more prone to over-helping |
| Tool use reliability | Leads on multi-step chains | Competitive on shorter chains |
| Context window | 200K (1M beta) | Larger nominal context |
| Flagship output price | $75 / 1M tokens (Opus) | Broadly comparable |
| Default for coding agents | Most common 2026 choice | Strong second |
The verdict: if you're building a coding agent or any long-running, multi-step tool-using workflow, Claude 4 is the more reliable default in 2026 โ that's where the SWE-bench lead and the instruction-following discipline compound. If your workload is heavier on isolated hard-reasoning or math problems, GPT-5 is at least as good and sometimes better. For most enterprise application work, the gap is small enough that price, latency, and which provider you already trust matter more than the benchmark delta. I cover the broader provider dynamics in OpenAI vs Anthropic in the enterprise.
Claude 4 pricing and what it costs to run
Pricing is where the tier structure pays off. Opus 4 runs $15 per million input tokens and $75 per million output tokens. Sonnet 4.5 is $3 input and $15 output โ five times cheaper โ and is where most production traffic belongs. Haiku 4.5 lands around $1 input and $5 output for high-volume work. Two discounts change the math materially: prompt caching cuts repeated input costs by up to 90%, and batch processing takes a further 50% off non-urgent jobs.
The practical pattern most teams converge on: route the bulk of requests to Sonnet, escalate only the genuinely hard agentic tasks to Opus, and push classification and routing to Haiku. Done well, that keeps a heavy coding-agent product's blended cost far below what running everything on the flagship would imply. Anthropic's reported run-rate โ climbing past $9B annualized through 2025 into 2026 โ is built on exactly this enterprise-API consumption, which feeds directly into the AI valuations debate.
What the Claude 4 release means for founders and investors
As a 3x founder and someone who's made 65+ investments, the Claude 4 release reframed two things I watch closely. First, the moat moved from "which model is smartest" to "which model is most reliable across a 30-step task" โ and reliability is far stickier than a benchmark lead. Second, the frozen API surface across releases is an underrated competitive weapon: every team that upgraded Opus 4 โ 4.5 โ 4.8 by changing one string is a team that didn't evaluate switching to a competitor.
Where Claude 4 wins
- โ Coding agents and long-horizon autonomy
- โ Multi-step tool chains that must not derail
- โ Strict instruction-following workloads
- โ Teams that value upgrade stability
Where to consider GPT-5
- โข Hardest isolated math/reasoning problems
- โข Workloads needing the largest nominal context
- โข Teams already deep in the OpenAI stack
- โข Consumer-facing chat over agentic backends
The Claude 4 release wasn't about winning a benchmark for a week.
It was about building the most reliable place to run an agent โ and freezing the API so nobody had a reason to leave.
Track frontier AI model valuations and adoption on the AI Valuations Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.