OpenAI o3 scored 87.7% on GPQA Diamond, 71.7% on SWE-bench Verified, 96.7% on AIME 2024, and a record 87.5% on ARC-AGI โ but that last number cost roughly $3,400 of compute per task. That's the short answer. The longer answer is more interesting.
When OpenAI previewed o3, the benchmark slides looked like a generational leap. Most of them are. But a benchmark is a measurement under conditions, and o3's best conditions are not the ones you get when you call the API. Knowing which scores transfer to real work โ and which are lab demonstrations โ is the difference between a smart deployment and an expensive surprise.
OpenAI o3 benchmark results, by task
OpenAI o3 is a reasoning model that spends test-time compute "thinking" before it answers, and its benchmark results were the strongest OpenAI had published at release: 87.7% on GPQA Diamond, 71.7% on SWE-bench Verified, 96.7% on AIME 2024, 25.2% on EpochAI Frontier Math, and 87.5% on ARC-AGI in its high-compute configuration. Each measures a different capability, and they do not all transfer to production equally.
| Benchmark | What it tests | o3 | o1 |
|---|---|---|---|
| GPQA Diamond | PhD-level science Q&A | 87.7% | 78% |
| SWE-bench Verified | Real GitHub bug fixes | 71.7% | 48.9% |
| AIME 2024 | Competition mathematics | 96.7% | 83.3% |
| Codeforces (Elo) | Competitive programming | 2727 | 1891 |
| Frontier Math | Research-level math | 25.2% | ~2% |
| ARC-AGI (low compute) | Abstract reasoning | 75.7% | 32% |
| ARC-AGI (high compute) | Abstract reasoning | 87.5% | โ |
Figures are from OpenAI's o3 preview and the ARC Prize Foundation's independent evaluation. o1 figures are the comparable release-time numbers.
What the o3 benchmark results actually mean
The single most important o3 result is SWE-bench Verified at 71.7%. This benchmark hands the model real, previously-unseen issues from open-source repositories and checks whether its patch passes the project's own test suite. Jumping from o1's 48.9% to 71.7% means o3 resolves roughly seven of every ten genuine software tasks unaided โ the gap between "helpful autocomplete" and "junior engineer who can close a ticket." For agentic coding products, that is the number that matters.
GPQA Diamond โ 87.7%
Beats most PhD experts in their own field (typically ~70%). Real signal for research-assistant use cases.
AIME 2024 โ 96.7%
Near-saturated. So high it no longer separates frontier models โ treat it as table stakes, not an edge.
Codeforces 2727 Elo
International grandmaster territory, ahead of most human competitive programmers. Strong but narrow.
Frontier Math โ 25.2%
From ~2% to 25% is huge, but 75% of research-grade problems still fail. The honest ceiling marker.
The ARC-AGI score and the cost caveat the headlines skip
ARC-AGI is designed to resist memorization โ each puzzle requires inferring a novel transformation rule from a handful of examples. No prior model had cracked it; GPT-4o sat near 5%, o1 around 32%. o3 posted 75.7% in a low-compute setting and 87.5% in a high-compute setting, the first time any system passed the ~85% human-level bar. That is a real milestone.
Then read the footnote. The ARC Prize Foundation reported that o3's high-compute run burned roughly $3,400 of compute per task โ and there were 400 tasks. Even the low-compute configuration that scored 75.7% ran around $20 per task. A human solver costs a few dollars. So o3 proved the capability is reachable, not that it is affordable. When you call the o3 API at default settings, you are not getting the $3,400-per-task brain; you are getting a far cheaper configuration that scores meaningfully lower.
The benchmark-to-bill gap
A headline score is the model's ceiling under unlimited compute. Your invoice reflects the floor you can afford. With reasoning models, those two numbers can differ by 100x or more โ which is why "o3 scored 87.5%" and "o3 will score 87.5% for you" are completely different claims.
Where the o3 benchmark results fall short
Even taking the numbers at face value, four problems limit how much you should read into them:
Compute-mode ambiguity
OpenAI's slides mixed low- and high-compute results without always labeling which was which. A 'best' score you can't reproduce at normal cost isn't a product spec โ it's a research result.
Saturation
AIME at 96.7% and GPQA near 88% are approaching the test ceiling. Once a benchmark saturates, it stops discriminating: a 1-point difference is noise, not capability.
Contamination risk
Public benchmarks leak into training data over time. High scores on well-known sets are always partly a question of how much of the test the model has effectively already seen.
Real-world transfer
SWE-bench tasks have clean test suites that declare success. Production work is ambiguous, under-specified, and multi-step โ exactly where benchmark-topping models still stumble and need human review.
o3 vs o1 vs o3-mini: which one should you actually run?
The benchmark gap between o3 and o1 is real and large โ a 22-point SWE-bench jump is not marketing. But for most teams the relevant comparison isn't o3 versus o1; it's o3 versus o3-mini versus a non-reasoning model like GPT-4o. o3-mini captures the bulk of the reasoning gains at a fraction of the latency and token cost, which is why it became the default reasoning workhorse for production. Reserve full o3 for the genuinely hard cases โ research-grade analysis, complex multi-file refactors, agentic loops โ where the extra accuracy clears the 10x-plus cost premium. For anything routine, the cheaper model is the rational choice, and the benchmark crown is irrelevant to your unit economics.
This is the same dynamic reshaping AI startup pricing and valuations more broadly โ capability is racing ahead while cost-per-useful-task becomes the real battleground. We track how those model economics flow into company pricing on the AI Valuations dashboard.
o3's benchmarks prove the frontier moved.
But a score you can't afford to reproduce is a research result, not a product spec โ read every o3 number with its compute mode and cost attached.
Track AI model economics and valuations on the AI Valuations Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.