When was Anthropic Claude 4 released?

Anthropic released Claude 4 — Opus 4 and Sonnet 4 — on May 22, 2025. The line was then iterated rapidly: Sonnet 4.5 arrived in September 2025, Opus 4.5 in November 2025, and Opus 4.8 in early 2026. Each release kept the same API surface so teams could upgrade by swapping a single model string.

How good is Claude 4 on benchmarks?

Claude Opus 4 scored 72.5% on SWE-bench Verified at launch in May 2025, the highest coding score of any frontier model at the time. The later 4.5 and 4.8 releases pushed past 80% on the same benchmark. On agentic tool-use and long-horizon tasks, the Claude 4 family has consistently led or tied the top of the leaderboard.

How does Claude 4 compare to GPT-5?

Claude 4 (Opus/Sonnet) and GPT-5 trade the lead depending on the task. Claude tends to win on agentic coding, long multi-step tool use, and instruction-following discipline, while GPT-5 is competitive on raw reasoning and math benchmarks. For production coding agents, most teams in 2026 report Claude as the more reliable default, though the gap is small and shifts with each release.

How much does Claude 4 cost to use?

Claude Opus 4 is priced at $15 per million input tokens and $75 per million output tokens. Sonnet 4.5 costs $3 per million input and $15 per million output — five times cheaper than Opus. Prompt caching can cut input costs by up to 90%, and batch processing offers a further 50% discount, which is how most production teams keep agent costs manageable.

What is the difference between Claude Opus, Sonnet, and Haiku?

Opus is the most capable and most expensive tier, built for complex agentic work and hard reasoning. Sonnet balances cost and capability and is the workhorse for most production applications at $3/$15 per million tokens. Haiku is the fastest and cheapest tier, used for high-volume, latency-sensitive tasks like classification and routing.

Claude 4: 80% on SWE-bench, May 2025 Release

Claude Opus 4 launched on May 22, 2025 at 72.5% on SWE-bench Verified — the highest coding score of any frontier model at the time, and the moment Claude became the default for production AI agents.

That's the short answer. The longer answer is more interesting — because the Claude 4 release was less a single model drop than the start of a 12-month sprint that took the line from 72.5% to past 80% on the same benchmark while keeping the API surface frozen so enterprises could upgrade in one line of code.

The Anthropic Claude 4 release: what actually shipped

The Anthropic Claude 4 release on May 22, 2025 introduced two models — Claude Opus 4 and Claude Sonnet 4 — built around hybrid reasoning, where a single model can answer instantly or switch into extended step-by-step thinking. Opus 4 posted 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, the strongest agentic-coding numbers in the industry, and Anthropic positioned it explicitly as the model for long-running, multi-hour autonomous tasks rather than one-shot chat.

What made the release matter wasn't a single benchmark. It was the combination of three things shipping at once: extended thinking with tool use (the model can search, run code, and reason in the same turn), parallel tool calls, and far better instruction-following discipline — the unglamorous trait that decides whether an agent finishes a 30-step task or derails at step 7. Anthropic also shipped memory files and the ability to use tools during reasoning, which is what turned Claude from a chatbot into a substrate for building agents.

Claude 4 release timeline and the tiers

The defining feature of the Claude 4 release was the cadence after it. Anthropic shipped incremental upgrades roughly every two to three months, each keeping the Opus / Sonnet / Haiku tier structure intact:

Model	Released	SWE-bench Verified	Positioning
Claude Opus 4	May 2025	72.5%	Flagship agentic coding
Claude Sonnet 4	May 2025	72.7%	Balanced workhorse
Claude Sonnet 4.5	Sep 2025	77.2%	Best price/performance
Claude Opus 4.5	Nov 2025	80.9%	Top-end reasoning + coding
Claude Haiku 4.5	Oct 2025	73.3%	Fast, cheap, high-volume
Claude Opus 4.8	early 2026	82%+	Current flagship

Benchmark figures are SWE-bench Verified at release; later models report higher scores with extended thinking and tool use enabled.

Three tiers, one purpose each. Opus is the heavyweight for complex agentic work. Sonnet is the workhorse — most production traffic runs here because it captures roughly 90% of Opus capability at one-fifth the cost. Haiku is the speed-and-volume tier for classification, routing, and latency-sensitive paths. The Claude 4 release standardized this so cleanly that switching tiers is a one-token change in the model string.

Claude 4 benchmark results: where it leads

Benchmarks are noisy, but the pattern across the Claude 4 family is consistent: it wins on agentic, long-horizon, tool-using tasks and stays competitive on raw reasoning. The numbers that matter most to people actually building things:

SWE-bench Verified

72.5% → 82%+

Real GitHub issue resolution — the benchmark that predicts coding-agent reliability

Terminal-bench

43.2%+

Autonomous command-line task completion

Tau-bench (agentic)

Top of leaderboard

Multi-turn tool use in retail/airline scenarios

GPQA Diamond

~80%+

Graduate-level science reasoning

Context window

200K tokens

1M-token context available in beta for long-document work

Sustained autonomy

7+ hours

Documented uninterrupted agentic coding runs on Opus 4 line

The number I keep coming back to is the 7+ hour autonomous run. A model that resolves 72.5% of real SWE-bench issues is impressive; a model that can stay on task for a full workday without losing the thread is a different category of product. That endurance — not the headline accuracy — is why coding-agent companies standardized on Claude.

Claude 4 vs GPT-5: the head-to-head

OpenAI's GPT-5 is the natural comparison, and the honest take is that they trade the lead by task. Here's how they line up across the dimensions that decide which one a team actually deploys:

Dimension	Claude 4 (Opus/Sonnet)	GPT-5
Agentic coding	Leads — 72.5%→82% SWE-bench, 7hr+ runs	Strong but trails on long-horizon
Raw reasoning / math	Competitive (~80% GPQA)	Slight edge on hardest math
Instruction-following	Best-in-class discipline	Good, more prone to over-helping
Tool use reliability	Leads on multi-step chains	Competitive on shorter chains
Context window	200K (1M beta)	Larger nominal context
Flagship output price	$75 / 1M tokens (Opus)	Broadly comparable
Default for coding agents	Most common 2026 choice	Strong second

The verdict: if you're building a coding agent or any long-running, multi-step tool-using workflow, Claude 4 is the more reliable default in 2026 — that's where the SWE-bench lead and the instruction-following discipline compound. If your workload is heavier on isolated hard-reasoning or math problems, GPT-5 is at least as good and sometimes better. For most enterprise application work, the gap is small enough that price, latency, and which provider you already trust matter more than the benchmark delta. I cover the broader provider dynamics in OpenAI vs Anthropic in the enterprise.

Claude 4 pricing and what it costs to run

Pricing is where the tier structure pays off. Opus 4 runs $15 per million input tokens and $75 per million output tokens. Sonnet 4.5 is $3 input and $15 output — five times cheaper — and is where most production traffic belongs. Haiku 4.5 lands around $1 input and $5 output for high-volume work. Two discounts change the math materially: prompt caching cuts repeated input costs by up to 90%, and batch processing takes a further 50% off non-urgent jobs.

The practical pattern most teams converge on: route the bulk of requests to Sonnet, escalate only the genuinely hard agentic tasks to Opus, and push classification and routing to Haiku. Done well, that keeps a heavy coding-agent product's blended cost far below what running everything on the flagship would imply. Anthropic's reported run-rate — climbing past $9B annualized through 2025 into 2026 — is built on exactly this enterprise-API consumption, which feeds directly into the AI valuations debate.

What the Claude 4 release means for founders and investors

As a 3x founder and someone who's made 65+ investments, the Claude 4 release reframed two things I watch closely. First, the moat moved from "which model is smartest" to "which model is most reliable across a 30-step task" — and reliability is far stickier than a benchmark lead. Second, the frozen API surface across releases is an underrated competitive weapon: every team that upgraded Opus 4 → 4.5 → 4.8 by changing one string is a team that didn't evaluate switching to a competitor.

Where Claude 4 wins

✓ Coding agents and long-horizon autonomy
✓ Multi-step tool chains that must not derail
✓ Strict instruction-following workloads
✓ Teams that value upgrade stability

Where to consider GPT-5

• Hardest isolated math/reasoning problems
• Workloads needing the largest nominal context
• Teams already deep in the OpenAI stack
• Consumer-facing chat over agentic backends

The Claude 4 release wasn't about winning a benchmark for a week.

It was about building the most reliable place to run an agent — and freezing the API so nobody had a reason to leave.

Track frontier AI model valuations and adoption on the AI Valuations Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.

Get VC data most people never see

— 100% free

Weekly benchmarks, valuations, and fund data. Join 5,000+ investors. No spam.

The Anthropic Claude 4 release: what actually shipped

Claude 4 release timeline and the tiers

Model	Released	SWE-bench Verified	Positioning
Claude Opus 4	May 2025	72.5%	Flagship agentic coding
Claude Sonnet 4	May 2025	72.7%	Balanced workhorse
Claude Sonnet 4.5	Sep 2025	77.2%	Best price/performance
Claude Opus 4.5	Nov 2025	80.9%	Top-end reasoning + coding
Claude Haiku 4.5	Oct 2025	73.3%	Fast, cheap, high-volume
Claude Opus 4.8	early 2026	82%+	Current flagship

Benchmark figures are SWE-bench Verified at release; later models report higher scores with extended thinking and tool use enabled.

Claude 4 benchmark results: where it leads

SWE-bench Verified

72.5% → 82%+

Real GitHub issue resolution — the benchmark that predicts coding-agent reliability

Terminal-bench

43.2%+

Autonomous command-line task completion

Tau-bench (agentic)

Top of leaderboard

Multi-turn tool use in retail/airline scenarios

GPQA Diamond

~80%+

Graduate-level science reasoning

Context window

200K tokens

1M-token context available in beta for long-document work

Sustained autonomy

7+ hours

Documented uninterrupted agentic coding runs on Opus 4 line

Claude 4 vs GPT-5: the head-to-head

OpenAI's GPT-5 is the natural comparison, and the honest take is that they trade the lead by task. Here's how they line up across the dimensions that decide which one a team actually deploys:

Dimension	Claude 4 (Opus/Sonnet)	GPT-5
Agentic coding	Leads — 72.5%→82% SWE-bench, 7hr+ runs	Strong but trails on long-horizon
Raw reasoning / math	Competitive (~80% GPQA)	Slight edge on hardest math
Instruction-following	Best-in-class discipline	Good, more prone to over-helping
Tool use reliability	Leads on multi-step chains	Competitive on shorter chains
Context window	200K (1M beta)	Larger nominal context
Flagship output price	$75 / 1M tokens (Opus)	Broadly comparable
Default for coding agents	Most common 2026 choice	Strong second

Claude 4 pricing and what it costs to run

What the Claude 4 release means for founders and investors

Where Claude 4 wins

✓ Coding agents and long-horizon autonomy
✓ Multi-step tool chains that must not derail
✓ Strict instruction-following workloads
✓ Teams that value upgrade stability

Where to consider GPT-5

• Hardest isolated math/reasoning problems
• Workloads needing the largest nominal context
• Teams already deep in the OpenAI stack
• Consumer-facing chat over agentic backends

The Claude 4 release wasn't about winning a benchmark for a week.

It was about building the most reliable place to run an agent — and freezing the API so nobody had a reason to leave.

Track frontier AI model valuations and adoption on the AI Valuations Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.

Get VC data most people never see

— 100% free

Weekly benchmarks, valuations, and fund data. Join 5,000+ investors. No spam.

Anthropic Claude 4: What's New, Benchmark Results, and How It Compares to GPT-5

The Anthropic Claude 4 release: what actually shipped

Claude 4 release timeline and the tiers

Claude 4 benchmark results: where it leads

Claude 4 vs GPT-5: the head-to-head

Claude 4 pricing and what it costs to run

What the Claude 4 release means for founders and investors

Frequently Asked Questions

Related Tools & Dashboards

Keep Reading

Anthropic Claude 4: What's New, Benchmark Results, and How It Compares to GPT-5

The Anthropic Claude 4 release: what actually shipped

Claude 4 release timeline and the tiers

Claude 4 benchmark results: where it leads

Claude 4 vs GPT-5: the head-to-head

Claude 4 pricing and what it costs to run

What the Claude 4 release means for founders and investors

Frequently Asked Questions

Related Tools & Dashboards

Keep Reading