GPT-5 scores roughly 89% on MMLU and a reported 74% on SWE-bench Verified โ up from GPT-4o's ~85% and ~33% โ while input pricing actually dropped to about $1.25 per million tokens. That's the short answer. The longer answer is more interesting.
The headline isn't any single benchmark. It's that OpenAI stopped making you choose between a fast chat model and a slow reasoning model. GPT-5 is one system that decides, per query, how long to think โ and that single design change is what reshapes how teams actually use it.
The OpenAI GPT-5 release: what shipped and when
OpenAI released GPT-5 in August 2025, about 18 months after GPT-4o launched in May 2024. It became the default model in ChatGPT for Plus and Pro subscribers immediately, shipped through the API the same week, and reached free users at a rate-limited tier within days. Unlike the GPT-4 to GPT-4o transition, this was not a speed-and-cost refresh of the same capability โ it was a genuine architecture change that folded the o-series reasoning line and the GPT chat line into a single model with an internal router.
In practice that means a trivial request ("rewrite this sentence") returns in well under a second, while a hard request ("debug this 400-line module") triggers extended thinking that can run 10โ60 seconds before answering. You no longer manually select o3 versus 4o. The model picks, and you can override with a reasoning-effort parameter in the API.
GPT-5 vs GPT-4o: the benchmark and pricing breakdown
The clearest way to see the GPT-5 release's impact is a direct comparison against GPT-4o across the metrics that matter for production use. Reported figures vary by configuration, but the directional gaps are consistent across independent evaluations.
| Attribute | GPT-4o | GPT-5 |
|---|---|---|
| MMLU (general knowledge) | ~85% | ~89% |
| SWE-bench Verified (coding) | ~33% | ~74% |
| AIME math (competition) | ~13% | ~94% (with tools) |
| Input price / 1M tokens | ~$2.50 | ~$1.25 |
| Output price / 1M tokens | ~$10 | ~$10 |
| Context window | 128K tokens | 256Kโ400K tokens |
| Reasoning mode | None (instant only) | Adaptive (auto-routes) |
| Hallucination rate (factual Q&A) | Baseline | ~45โ80% lower |
The single most consequential row is SWE-bench Verified: jumping from ~33% to ~74% means GPT-5 resolves more than twice as many real GitHub issues end-to-end. That's the difference between an assistant you babysit and one you can hand a ticket.
What GPT-5 can actually do that GPT-4o couldn't
Long-horizon agentic tasks
Chains 20+ tool calls without losing the plot โ GPT-4o drifted after a handful of steps
Production-grade coding
~74% SWE-bench Verified means it closes real issues, not just autocompletes
Adaptive reasoning depth
Spends compute only on hard queries, keeping average latency and cost down
Fewer confident wrong answers
Reported 45โ80% lower hallucination on factual queries vs GPT-4o
Larger context
256Kโ400K tokens lets it hold an entire codebase or contract in one pass
Cheaper input
~$1.25/M input is half of GPT-4o, so RAG and long-prompt workloads cost less
Who the GPT-5 release is actually for
Not everyone needs to migrate. The model tiers โ GPT-5, gpt-5-mini, and gpt-5-nano โ exist precisely because the right answer depends on the workload. Here's how I'd route real use cases:
Worth the upgrade
- โ Coding agents and dev tooling (the ~40-point SWE-bench jump is decisive)
- โ Multi-step research and analysis where errors compound
- โ Customer-facing agents that must not hallucinate
- โ Long-document workflows that need 256K+ context
Stick with 4o or mini
- โ High-volume classification and tagging
- โ Simple chat and FAQ responders
- โ Short summarization at scale
- โ Latency-critical paths where a second matters
The cheaper input price changes the math for retrieval-heavy products. If you were paying $2.50 per million input tokens on GPT-4o to stuff context into prompts, halving that to ~$1.25 on GPT-5 โ while getting better answers โ is a rare free lunch. For founders pricing AI features into a product, this is exactly the kind of unit-economics shift that moves a feature from margin-negative to viable. It's also why OpenAI's reported run-rate keeps climbing; see the AI Valuations dashboard for how that revenue feeds the model's ~$300B+ valuation.
How GPT-5 stacks up against Claude and Gemini
The GPT-5 release didn't open a clear gap over the rest of the frontier โ it closed one OpenAI had been losing on coding. On SWE-bench Verified, GPT-5's ~74% sits in the same band as Anthropic's frontier Claude models (low-to-mid 70s) and ahead of most Gemini configurations (high 60s). On price, GPT-5's ~$1.25/M input undercuts Claude's comparable tier and roughly matches Gemini. The honest read in mid-2026: the three labs trade the lead benchmark-by-benchmark, and the deciding factors are tooling, ecosystem, and how each model behaves on your specific workload โ not a single leaderboard number.
That convergence is itself the story. Two years ago a 30-point gap separated the best model from the pack. Today the frontier is bunched within a few points on most benchmarks, which is why the AI investment thesis has quietly shifted from "who has the best model" to "who has distribution, data, and switching costs."
The GPT-5 release isn't about one benchmark beating GPT-4o.
It's that reasoning got cheap enough and automatic enough to put in every product โ and that, not the leaderboard, is what changes the market.
Track AI model economics and valuations on the AI Valuations Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.