What are the Anthropic Claude 4 benchmark scores in 2026?

Across the Claude 4 family, the headline result is SWE-bench Verified at roughly 72% to 82%, depending on the model and whether parallel test-time compute is used — the strongest agentic-coding score among frontier models. Claude 4 also posts around 80% on GPQA Diamond (graduate-level science), 88% or higher on multilingual MMLU, and competitive numbers on AIME math and Terminal-Bench. These are Anthropic-published figures and should be treated as a ceiling rather than a typical real-world result.

Is Claude 4 better than GPT-5 on benchmarks?

On agentic coding — SWE-bench Verified and tool-use tasks — Claude 4 generally leads or ties GPT-5, which is why it dominates AI coding tools. On pure reasoning and math contests like AIME and GPQA Diamond, GPT-5 and Claude 4 trade the top spot within a few points, and Gemini 2.5 Pro is close behind. No single model wins every benchmark; the gaps between the top three frontier models are now usually under 5 percentage points on most evals.

What is SWE-bench Verified and why does it matter for Claude 4?

SWE-bench Verified is a benchmark of 500 real GitHub issues that a model must fix by editing a live codebase and passing the project's tests. It matters because it measures agentic software engineering — multi-step reasoning, tool use, and code that actually compiles — rather than trivia recall. Claude 4 scoring 72–82% on it is the single most predictive benchmark for whether the model will perform well inside coding agents like Claude Code, Cursor, and similar tools.

How much does Claude 4 cost per token in 2026?

Claude 4 pricing spans roughly $1 to $75 per million tokens depending on the model. Claude Haiku runs about $1 in and $5 out per million tokens, Claude Sonnet about $3 in and $15 out, and Claude Opus about $15 in and $75 out. Prompt caching and batch processing can cut effective costs by 50% to 90%, which often matters more for production economics than the raw benchmark score.

Should buyers trust vendor-published Claude 4 benchmarks?

Treat them as directional, not definitive. Vendor benchmarks are run under favorable conditions — best prompts, extended thinking, sometimes parallel sampling — and the headline number is often the high end of a range. The reliable signal is relative: Claude 4's consistent lead on agentic-coding evals across multiple independent trackers is more trustworthy than any single decimal. The only benchmark that truly counts is an evaluation on your own task with your own data.

Claude 4 Benchmarks: 72–82% SWE-bench, 80% GPQA

Claude 4 posts the best agentic-coding benchmark in the industry — 72% to 82% on SWE-bench Verified — while clearing roughly 80% on GPQA Diamond and 88%+ on multilingual reasoning. That's the short answer. The longer answer is more interesting.

Benchmark tables are how AI labs sell models, and they're half measurement, half marketing. The numbers are real, but they're run under ideal conditions — best prompt, extended thinking, sometimes parallel sampling — and the figure that ends up on the launch slide is usually the top of a range. If you're a buyer deciding where to route a workload or which model to wrap in a product, the scores matter less than what they predict about your task. Here's the full Claude 4 benchmark picture, by model and by price, and how to read it without getting sold.

Anthropic Claude 4 Benchmarks: The Numbers That Actually Matter

The Anthropic Claude 4 benchmarks that matter most cluster around three families: agentic coding (SWE-bench Verified, Terminal-Bench), reasoning and science (GPQA Diamond, AIME, MMMU), and knowledge and language (MMLU, multilingual MMLU). Claude 4's defining result is SWE-bench Verified at 72–82%, the strongest agentic-coding score of any frontier model. On the reasoning and knowledge tests it runs within a few points of GPT-5 and Gemini 2.5 Pro rather than dominating them.

That distribution is the whole story. Claude 4 is not uniformly the "smartest" model on every leaderboard — it's the model that turns intelligence into working code and completed multi-step tasks more reliably than anything else. For most buyers, that's the benchmark that pays. Here is how the family scores across the headline evals.

Benchmark	What it measures	Claude 4 (Opus / Sonnet)	Read on it
SWE-bench Verified	Fixing 500 real GitHub issues	~72–82%	Best-in-class; the headline
Terminal-Bench	Agentic command-line tasks	~43–50%	Leading, but still hard for all models
GPQA Diamond	Graduate-level science Q&A	~79–83%	Frontier-level, ties GPT-5
AIME (math)	Competition math problems	~75–90%	Strong with extended thinking on
MMMU	Multimodal college-level reasoning	~76–80%	Competitive, not category-leading
MMLU (multilingual)	Knowledge across 14+ languages	~88–90%	Near the top of the field
Tau-bench (tool use)	Agentic tool / function calling	~70–82%	Best-in-class for real agents

Figures are 2025–2026 ranges blended from Anthropic's published model cards, Artificial Analysis, and independent SWE-bench leaderboards. Ranges span the Claude 4 family (Opus and Sonnet variants) and reflect differences in extended thinking and parallel test-time compute; the high end usually requires settings most production deployments do not run.

How the Claude 4 Benchmarks Compare to GPT-5 and Gemini

The honest summary of the Claude 4 benchmarks versus the rest of the frontier: the gaps are small and the leadership is category-specific. Claude 4 wins agentic coding and tool use. GPT-5 and Claude 4 trade the lead on raw reasoning and math. Gemini 2.5 Pro wins on context length and multimodal breadth and stays within a few points everywhere else. No model sweeps the board, and the spread between the top three on most evals is now under 5 percentage points.

Capability	Claude 4	GPT-5	Gemini 2.5 Pro
Agentic coding (SWE-bench)	Leader (~72–82%)	~70–75%	~64–70%
Tool use / agents	Leader	Strong	Strong
Math (AIME)	Strong	Leader	Strong
Science (GPQA)	~79–83%	~80–85%	~82–84%
Context window	200K (1M beta)	~256K–400K	1M+ (leader)
Multimodal breadth	Good	Strong	Leader
Top-tier price (per M out)	$75 (Opus)	~$10–60	~$10–15

Comparison blended from vendor model cards, Artificial Analysis, and the LMSYS/Chatbot Arena leaderboard as of mid-2026. Scores shift with each point release; treat category leadership as more durable than any single percentage. Pricing reflects list rates before caching or batch discounts.

The practical takeaway: if your workload is code, agents, or anything that chains tool calls, the Claude 4 benchmarks justify the choice. If it's long-document analysis at a million tokens, Gemini's context window may matter more than a few SWE-bench points. The valuations behind these labs reflect exactly that race — you can see where Anthropic and its peers sit on the AI Valuations dashboard.

Claude 4 Benchmark Scores by Model and Price

The Claude 4 benchmark numbers only mean something next to their price. Anthropic ships a tiered family — Opus, Sonnet, Haiku — where each step up buys a few points of capability at a multiple of the cost. The SWE-bench gap between Sonnet and Opus is often smaller than the price gap, which is why most production traffic runs on Sonnet, not the flagship.

Model tier	SWE-bench Verified	Input $/M	Output $/M	Best for
Claude Opus 4	~72–80%	~$15	~$75	Hardest reasoning & agent runs
Claude Sonnet 4 / 4.5	~73–82%	~$3	~$15	Production coding & agents
Claude Haiku 4.5	~55–65%	~$1	~$5	High-volume, latency-sensitive
+ Prompt caching	—	up to −90%	—	Repeated-context workloads
+ Batch API	—	−50%	−50%	Async, non-urgent jobs
Extended thinking	+3–8 pts	more tokens	more tokens	Buys accuracy, costs latency

Pricing and scores are 2026 list-rate estimates from Anthropic's pricing page and published model cards; exact figures vary by point release and region. Caching and batch discounts are Anthropic-documented program rates. SWE-bench ranges reflect with/without extended thinking and parallel sampling.

Notice the pattern that doesn't show up in a launch headline: a roughly 5x price step from Sonnet to Opus often buys low-single-digit benchmark points. For most teams, Sonnet plus prompt caching is the rational default, with Opus reserved for the genuinely hard runs. That cost curve — capability getting cheaper per token every quarter — is the same dynamic reshaping software margins across the sector, which I've dug into on the SaaS Valuations dashboard.

What the Claude 4 Benchmarks Don't Tell You

Here's the part the table can't capture. A benchmark is a fixed test with a known answer key; your workload isn't. Three things consistently break the link between a high Claude 4 benchmark score and real-world performance, and buyers who ignore them get burned.

One: contamination and overfitting. Frontier models are trained on enormous slices of the public internet, and popular benchmarks leak into training data. A 90% on a well-known eval partly measures how much of that eval the model has effectively memorized. SWE-bench Verified is harder to game because it requires working code against hidden tests, which is exactly why it's the most trusted number in the table.

Two: the settings asterisk. The high end of every range — the 82% rather than the 72% — usually requires extended thinking, longer time budgets, or parallel sampling that you may not run in production because it multiplies latency and cost. The launch number and the number you actually get can differ by 5–10 points purely on configuration.

Three: the things no benchmark scores. Latency, refusal behavior, instruction-following on messy real prompts, output formatting reliability, and how the model degrades on your weird edge cases — none of that shows up on a leaderboard, and all of it determines whether a deployment works. This is why Claude's reputation in agentic coding tools outruns its raw scores: it tends to follow instructions and stay on task in ways the percentages don't express.

How Buyers Should Actually Use the Claude 4 Benchmarks

My rule after watching a dozen teams pick models: use vendor benchmarks to build a shortlist, never to make the final call. The Claude 4 benchmarks reliably tell you it belongs on the shortlist for any coding, agent, or tool-use workload — that's the durable signal across multiple independent trackers. They do not tell you it's right for your specific task.

So do the cheap thing that 90% of buyers skip: assemble 50–100 real examples from your own workflow, run them through Sonnet and Opus and one competitor, and score the outputs on the dimensions you care about — correctness, latency, format reliability, cost per task. That eval costs a few hundred dollars in tokens and a day of work, and it's worth more than every public leaderboard combined. The frontier models are close enough now that the right answer is almost always workload-specific.

And watch the trajectory, not just the snapshot. Anthropic ships point releases every few months, each nudging the benchmarks up and the effective price down. The Claude 4 score you see today is a floor, not a ceiling — which is exactly why the economics of building on these models keep improving even when the headline number barely moves.

Benchmarks build the shortlist. Your own eval makes the call.

Claude 4 leads agentic coding at 72–82% SWE-bench Verified and runs within 5 points of GPT-5 and Gemini everywhere else — but the only score that decides your deployment is the one you run on your own data.

Track AI model economics, frontier-lab valuations, and the companies building on them on the AI Valuations dashboard and the Unicorns tracker at Value Add VC. Originally published in the Trace Cohen newsletter.

Get VC data most people never see — free.

Weekly benchmarks, valuations, and fund data. No spam, unsubscribe anytime.

Anthropic Claude 4 Benchmarks: The Numbers That Actually Matter

Benchmark	What it measures	Claude 4 (Opus / Sonnet)	Read on it
SWE-bench Verified	Fixing 500 real GitHub issues	~72–82%	Best-in-class; the headline
Terminal-Bench	Agentic command-line tasks	~43–50%	Leading, but still hard for all models
GPQA Diamond	Graduate-level science Q&A	~79–83%	Frontier-level, ties GPT-5
AIME (math)	Competition math problems	~75–90%	Strong with extended thinking on
MMMU	Multimodal college-level reasoning	~76–80%	Competitive, not category-leading
MMLU (multilingual)	Knowledge across 14+ languages	~88–90%	Near the top of the field
Tau-bench (tool use)	Agentic tool / function calling	~70–82%	Best-in-class for real agents

How the Claude 4 Benchmarks Compare to GPT-5 and Gemini

Capability	Claude 4	GPT-5	Gemini 2.5 Pro
Agentic coding (SWE-bench)	Leader (~72–82%)	~70–75%	~64–70%
Tool use / agents	Leader	Strong	Strong
Math (AIME)	Strong	Leader	Strong
Science (GPQA)	~79–83%	~80–85%	~82–84%
Context window	200K (1M beta)	~256K–400K	1M+ (leader)
Multimodal breadth	Good	Strong	Leader
Top-tier price (per M out)	$75 (Opus)	~$10–60	~$10–15

Claude 4 Benchmark Scores by Model and Price

Model tier	SWE-bench Verified	Input $/M	Output $/M	Best for
Claude Opus 4	~72–80%	~$15	~$75	Hardest reasoning & agent runs
Claude Sonnet 4 / 4.5	~73–82%	~$3	~$15	Production coding & agents
Claude Haiku 4.5	~55–65%	~$1	~$5	High-volume, latency-sensitive
+ Prompt caching	—	up to −90%	—	Repeated-context workloads
+ Batch API	—	−50%	−50%	Async, non-urgent jobs
Extended thinking	+3–8 pts	more tokens	more tokens	Buys accuracy, costs latency

What the Claude 4 Benchmarks Don't Tell You

How Buyers Should Actually Use the Claude 4 Benchmarks

Benchmarks build the shortlist. Your own eval makes the call.

Get VC data most people never see — free.

Weekly benchmarks, valuations, and fund data. No spam, unsubscribe anytime.

Anthropic Claude 4 Benchmark Performance: What the Scores Actually Mean for Buyers

Anthropic Claude 4 Benchmarks: The Numbers That Actually Matter

How the Claude 4 Benchmarks Compare to GPT-5 and Gemini

Claude 4 Benchmark Scores by Model and Price

What the Claude 4 Benchmarks Don't Tell You

How Buyers Should Actually Use the Claude 4 Benchmarks

Frequently Asked Questions

Related Tools & Dashboards

Keep Reading

Anthropic Claude 4 Benchmark Performance: What the Scores Actually Mean for Buyers

Anthropic Claude 4 Benchmarks: The Numbers That Actually Matter

How the Claude 4 Benchmarks Compare to GPT-5 and Gemini

Claude 4 Benchmark Scores by Model and Price

What the Claude 4 Benchmarks Don't Tell You

How Buyers Should Actually Use the Claude 4 Benchmarks

Frequently Asked Questions

Related Tools & Dashboards

Keep Reading