The release of o3 and o4-mini wasn't just another model upgrade. It was OpenAI announcing that the era of "scale the pretraining" is being supplemented by something new: scale the thinking.
Reasoning models don't just predict the next token faster. They spend compute at inference time โ thinking through subproblems, backtracking, self-correcting โ before producing an answer. This is a fundamentally different architecture for capability, and it has real consequences for how you build AI applications, what you pay, and which problems are now solvable.
What the OpenAI o3 Reasoning Model Actually Does
o3 uses an extended internal chain-of-thought. When you send a hard problem, the model generates reasoning tokens โ essentially a scratchpad โ before producing the final answer. You don't see the thinking, but you pay for it. This means o3 can solve problems that stump o1 and GPT-4o not because it has more parameters, but because it gets more time to think.
| Benchmark | o3 | o4-mini | o1 | GPT-4o |
|---|---|---|---|---|
| ARC-AGI (novel reasoning) | 87.5% | ~70% | 32% | ~5% |
| GPQA Diamond (PhD science) | ~88% | ~80% | 78% | 53% |
| AIME 2024 (math competition) | 96.7% | 93% | 83% | 13% |
| SWE-bench (software eng.) | 71% | 68% | 49% | 46% |
| Input cost ($/M tokens) | $10 | $1.10 | $15 | $2.50 |
Sources: OpenAI technical reports, independent benchmark evaluations, mid-2026 API pricing.
The o4-mini Case: When Good Enough Is the Right Call
o4-mini was the underrated release. At $1.10/M input tokens, it delivers reasoning-model-grade performance on most practical tasks โ coding, math, structured analysis, multi-step workflows โ without the cost structure that makes o3 impractical at scale.
For most AI application builders in 2026, o4-mini is the right default for reasoning-heavy features. The cost savings are real: a product processing 100M input tokens per month pays $110K on o4-mini vs $1M on o3. On the SWE-bench software engineering benchmark, the delta is 3 percentage points (68% vs 71%). Most companies can't justify paying 9x for a 3-point improvement in task completion.
The Hidden Cost: Reasoning Tokens
The headline API price understates the real cost. Reasoning models generate thinking tokens โ internal scratchpad content โ that are billed at the output token rate. On complex tasks, reasoning tokens can represent 20โ50% of total token usage, and they're invisible to you in the response.
A query that produces 500 visible output tokens might consume 1,500 reasoning tokens behind the scenes. At o3's $40/M output rate, that's $60 per million queries more than the input price alone suggests. Teams building on o3 at scale need to instrument reasoning token consumption carefully โ it's the number that actually blows up your inference budget.
o4-mini mitigates this significantly. Lower output token prices ($4.40/M) mean even 50% reasoning overhead keeps the effective cost manageable. This is one of the most underappreciated reasons o4-mini is the better production choice for most applications.
What This Means for AI Startups and Application Builders
Three things shift when reasoning models become the baseline:
The floor for "capable enough" just rose dramatically
Problems that required multi-agent orchestration or custom fine-tuning 18 months ago are now solvable with a single o4-mini API call. This compresses the competitive moat of AI infrastructure startups and raises the bar for what "good" looks like to enterprise buyers.
Prompt engineering matters less; task specification matters more
Reasoning models are better at inferring intent from underspecified prompts. The differentiating skill is now defining the right task boundary โ what the model should and shouldn't try to reason through โ not crafting elaborate prompts to compensate for model limitations.
Latency becomes the new constraint
o3 thinking on a hard problem can take 30โ120 seconds. That's fine for asynchronous workflows; it's unusable for real-time UX. The architecture split between fast-path (GPT-4o) and deep-reasoning (o3/o4-mini) is becoming a standard design pattern in production AI systems.
The Investor Angle: What Reasoning Models Do to AI Valuations
I track AI company valuations at the AI Valuations dashboard and the trend is clear: reasoning model capability is being priced as a category separator, not a feature. OpenAI crossed $10B ARR in early 2026 โ the fastest SaaS company to that milestone โ and the o-series models are the primary driver of enterprise upsell.
For startups, the implication is straightforward: if your product's value proposition is "we use AI to do X," and X is now achievable with an o4-mini API call, your moat is thinner than it was 12 months ago. The startups holding their ground are the ones with proprietary data, workflow lock-in, and vertical depth that can't be replicated by upgrading a model.
Conversely, if you're building in a domain where reasoning model capability unlocks a genuinely new workflow โ one that required humans before โ you're in the most interesting part of the market right now. Legal, scientific research, complex financial modeling, and engineering design are all seeing that shift in real time.
The model is no longer the product differentiator.
The data, workflow integration, and domain depth are. o3 and o4-mini just raised the floor โ the ceiling is still yours to build.
Track AI company valuations and frontier model pricing at the AI Valuations Dashboard and AI Spending Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.