AI & TechnologyJune 13, 2026ยท11 min readยทLast updated: June 13, 2026

OpenAI o3 Benchmark Results: What the Scores Mean and Where They Mislead

o3 posted the highest numbers OpenAI has ever published โ€” 87.7% on GPQA, 71.7% on SWE-bench, 87.5% on ARC-AGI. But the headline scores hide a cost structure and a set of caveats that change how you should actually read them.

TC
Trace Cohen
Co-Founder & GP at Six Point Ventures ยท 3x founder (BrandYourself, Launch.it, SPOT) ยท 65+ investments ยท Based in Boca Raton, FL

Quick Answer

87.7% on GPQA Diamond, 71.7% on SWE-bench Verified, 96.7% on AIME 2024, 25.2% on Frontier Math, and 87.5% on ARC-AGI in high-compute mode โ€” those are OpenAI o3's headline benchmark results, the best the company had published at release. The catch: the ARC-AGI breakthrough cost roughly $3,400 of compute per task, so the scores prove capability without proving it is economical to run at scale.

OpenAI o3 scored 87.7% on GPQA Diamond, 71.7% on SWE-bench Verified, 96.7% on AIME 2024, and a record 87.5% on ARC-AGI โ€” but that last number cost roughly $3,400 of compute per task. That's the short answer. The longer answer is more interesting.

When OpenAI previewed o3, the benchmark slides looked like a generational leap. Most of them are. But a benchmark is a measurement under conditions, and o3's best conditions are not the ones you get when you call the API. Knowing which scores transfer to real work โ€” and which are lab demonstrations โ€” is the difference between a smart deployment and an expensive surprise.

OpenAI o3 benchmark results, by task

OpenAI o3 is a reasoning model that spends test-time compute "thinking" before it answers, and its benchmark results were the strongest OpenAI had published at release: 87.7% on GPQA Diamond, 71.7% on SWE-bench Verified, 96.7% on AIME 2024, 25.2% on EpochAI Frontier Math, and 87.5% on ARC-AGI in its high-compute configuration. Each measures a different capability, and they do not all transfer to production equally.

BenchmarkWhat it testso3o1
GPQA DiamondPhD-level science Q&A87.7%78%
SWE-bench VerifiedReal GitHub bug fixes71.7%48.9%
AIME 2024Competition mathematics96.7%83.3%
Codeforces (Elo)Competitive programming27271891
Frontier MathResearch-level math25.2%~2%
ARC-AGI (low compute)Abstract reasoning75.7%32%
ARC-AGI (high compute)Abstract reasoning87.5%โ€”

Figures are from OpenAI's o3 preview and the ARC Prize Foundation's independent evaluation. o1 figures are the comparable release-time numbers.

What the o3 benchmark results actually mean

The single most important o3 result is SWE-bench Verified at 71.7%. This benchmark hands the model real, previously-unseen issues from open-source repositories and checks whether its patch passes the project's own test suite. Jumping from o1's 48.9% to 71.7% means o3 resolves roughly seven of every ten genuine software tasks unaided โ€” the gap between "helpful autocomplete" and "junior engineer who can close a ticket." For agentic coding products, that is the number that matters.

GPQA Diamond โ€” 87.7%

Beats most PhD experts in their own field (typically ~70%). Real signal for research-assistant use cases.

AIME 2024 โ€” 96.7%

Near-saturated. So high it no longer separates frontier models โ€” treat it as table stakes, not an edge.

Codeforces 2727 Elo

International grandmaster territory, ahead of most human competitive programmers. Strong but narrow.

Frontier Math โ€” 25.2%

From ~2% to 25% is huge, but 75% of research-grade problems still fail. The honest ceiling marker.

The ARC-AGI score and the cost caveat the headlines skip

ARC-AGI is designed to resist memorization โ€” each puzzle requires inferring a novel transformation rule from a handful of examples. No prior model had cracked it; GPT-4o sat near 5%, o1 around 32%. o3 posted 75.7% in a low-compute setting and 87.5% in a high-compute setting, the first time any system passed the ~85% human-level bar. That is a real milestone.

Then read the footnote. The ARC Prize Foundation reported that o3's high-compute run burned roughly $3,400 of compute per task โ€” and there were 400 tasks. Even the low-compute configuration that scored 75.7% ran around $20 per task. A human solver costs a few dollars. So o3 proved the capability is reachable, not that it is affordable. When you call the o3 API at default settings, you are not getting the $3,400-per-task brain; you are getting a far cheaper configuration that scores meaningfully lower.

The benchmark-to-bill gap

A headline score is the model's ceiling under unlimited compute. Your invoice reflects the floor you can afford. With reasoning models, those two numbers can differ by 100x or more โ€” which is why "o3 scored 87.5%" and "o3 will score 87.5% for you" are completely different claims.

Where the o3 benchmark results fall short

Even taking the numbers at face value, four problems limit how much you should read into them:

โœ•

Compute-mode ambiguity

OpenAI's slides mixed low- and high-compute results without always labeling which was which. A 'best' score you can't reproduce at normal cost isn't a product spec โ€” it's a research result.

โœ•

Saturation

AIME at 96.7% and GPQA near 88% are approaching the test ceiling. Once a benchmark saturates, it stops discriminating: a 1-point difference is noise, not capability.

โœ•

Contamination risk

Public benchmarks leak into training data over time. High scores on well-known sets are always partly a question of how much of the test the model has effectively already seen.

โœ•

Real-world transfer

SWE-bench tasks have clean test suites that declare success. Production work is ambiguous, under-specified, and multi-step โ€” exactly where benchmark-topping models still stumble and need human review.

o3 vs o1 vs o3-mini: which one should you actually run?

The benchmark gap between o3 and o1 is real and large โ€” a 22-point SWE-bench jump is not marketing. But for most teams the relevant comparison isn't o3 versus o1; it's o3 versus o3-mini versus a non-reasoning model like GPT-4o. o3-mini captures the bulk of the reasoning gains at a fraction of the latency and token cost, which is why it became the default reasoning workhorse for production. Reserve full o3 for the genuinely hard cases โ€” research-grade analysis, complex multi-file refactors, agentic loops โ€” where the extra accuracy clears the 10x-plus cost premium. For anything routine, the cheaper model is the rational choice, and the benchmark crown is irrelevant to your unit economics.

This is the same dynamic reshaping AI startup pricing and valuations more broadly โ€” capability is racing ahead while cost-per-useful-task becomes the real battleground. We track how those model economics flow into company pricing on the AI Valuations dashboard.

o3's benchmarks prove the frontier moved.

But a score you can't afford to reproduce is a research result, not a product spec โ€” read every o3 number with its compute mode and cost attached.

Track AI model economics and valuations on the AI Valuations Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.

Frequently Asked Questions

What are OpenAI o3's benchmark results?

OpenAI o3 scores 87.7% on GPQA Diamond (PhD-level science), 71.7% on SWE-bench Verified (real software fixes), 96.7% on the 2024 AIME math competition, 25.2% on EpochAI's Frontier Math, and a breakthrough 87.5% on the ARC-AGI abstraction benchmark in its high-compute setting. Those are the highest published scores from any OpenAI model at release, but several came in expensive test-time-compute configurations that don't match the default API model.

How does o3 compare to o1 on benchmarks?

o3 beats o1 across every major benchmark. On SWE-bench Verified o3 hit 71.7% versus o1's roughly 48.9% โ€” a 22-point jump. On the AIME 2024 math exam o3 reached 96.7% versus o1's 83.3%, and on GPQA Diamond o3 posted 87.7% against o1's 78%. The largest gap is on ARC-AGI, where o3's 87.5% high-compute score dwarfs o1's 32%.

What was o3's ARC-AGI score and why is it controversial?

o3 scored 75.7% on ARC-AGI in a low-compute setting and 87.5% in a high-compute setting, the first model to clear the 85% human-level threshold. The controversy is cost: the high-compute run consumed roughly $3,400 of compute per task โ€” thousands of times more than a human would cost โ€” so the score proves capability exists but not that it is economical to deploy.

Where does o3 fall short on benchmarks?

o3 still scores only about 25.2% on Frontier Math, meaning three-quarters of research-level math problems remain unsolved. Its strongest results often come from high-compute modes that cost orders of magnitude more than standard inference, benchmark contamination is hard to rule out, and saturated tests like AIME (96.7%) no longer discriminate between top models. High scores also do not guarantee reliability on messy real-world tasks.

Is o3 worth using over o3-mini or GPT-4o for most tasks?

For most production work, no. o3-mini delivers most of the reasoning gains at a fraction of the cost and latency, and GPT-4o is faster and cheaper still for non-reasoning tasks. o3's benchmark-topping configurations are best reserved for genuinely hard reasoning, research, or agentic coding problems where the extra accuracy justifies 10x or higher token costs.

Explore 45+ free VC tools, dashboards, and recommended startup software.