AI & TechnologyMay 28, 2026ยท8 min readยทLast updated: May 28, 2026

OpenAI o3 and o4 Mini: What the New Reasoning Models Mean for AI Applications

Reasoning models don't just generate text faster โ€” they think before they answer. o3 scores 87.5% on ARC-AGI and approaches PhD-level science performance. o4-mini gets you most of that at 9x lower API cost. Here's what this shift means for every team building on AI.

TC
Trace Cohen
3x founder, 65+ investments, building Value Add VC

Quick Answer

OpenAI o3 is a reasoning model that scores 87.5% on ARC-AGI and ~88% on GPQA Diamond (PhD-level science), using test-time compute to think through hard problems before answering. o4-mini delivers 80โ€“90% of o3's capability at $1.10/M input tokens vs $10/M for o3 โ€” making reasoning-grade AI economically viable for production applications.

The release of o3 and o4-mini wasn't just another model upgrade. It was OpenAI announcing that the era of "scale the pretraining" is being supplemented by something new: scale the thinking.

Reasoning models don't just predict the next token faster. They spend compute at inference time โ€” thinking through subproblems, backtracking, self-correcting โ€” before producing an answer. This is a fundamentally different architecture for capability, and it has real consequences for how you build AI applications, what you pay, and which problems are now solvable.

What the OpenAI o3 Reasoning Model Actually Does

o3 uses an extended internal chain-of-thought. When you send a hard problem, the model generates reasoning tokens โ€” essentially a scratchpad โ€” before producing the final answer. You don't see the thinking, but you pay for it. This means o3 can solve problems that stump o1 and GPT-4o not because it has more parameters, but because it gets more time to think.

Benchmarko3o4-minio1GPT-4o
ARC-AGI (novel reasoning)87.5%~70%32%~5%
GPQA Diamond (PhD science)~88%~80%78%53%
AIME 2024 (math competition)96.7%93%83%13%
SWE-bench (software eng.)71%68%49%46%
Input cost ($/M tokens)$10$1.10$15$2.50

Sources: OpenAI technical reports, independent benchmark evaluations, mid-2026 API pricing.

The o4-mini Case: When Good Enough Is the Right Call

o4-mini was the underrated release. At $1.10/M input tokens, it delivers reasoning-model-grade performance on most practical tasks โ€” coding, math, structured analysis, multi-step workflows โ€” without the cost structure that makes o3 impractical at scale.

For most AI application builders in 2026, o4-mini is the right default for reasoning-heavy features. The cost savings are real: a product processing 100M input tokens per month pays $110K on o4-mini vs $1M on o3. On the SWE-bench software engineering benchmark, the delta is 3 percentage points (68% vs 71%). Most companies can't justify paying 9x for a 3-point improvement in task completion.

o3
Max capability tasks
PhD-level research, expert legal analysis, novel scientific reasoning, one-shot complex code architecture
o4-mini
Reasoning at scale
Coding assistants, financial modeling, multi-step automation, structured data extraction
GPT-4o
Speed + volume
Real-time chat, content generation, classification, high-throughput query processing

The Hidden Cost: Reasoning Tokens

The headline API price understates the real cost. Reasoning models generate thinking tokens โ€” internal scratchpad content โ€” that are billed at the output token rate. On complex tasks, reasoning tokens can represent 20โ€“50% of total token usage, and they're invisible to you in the response.

A query that produces 500 visible output tokens might consume 1,500 reasoning tokens behind the scenes. At o3's $40/M output rate, that's $60 per million queries more than the input price alone suggests. Teams building on o3 at scale need to instrument reasoning token consumption carefully โ€” it's the number that actually blows up your inference budget.

o4-mini mitigates this significantly. Lower output token prices ($4.40/M) mean even 50% reasoning overhead keeps the effective cost manageable. This is one of the most underappreciated reasons o4-mini is the better production choice for most applications.

What This Means for AI Startups and Application Builders

Three things shift when reasoning models become the baseline:

The floor for "capable enough" just rose dramatically

Problems that required multi-agent orchestration or custom fine-tuning 18 months ago are now solvable with a single o4-mini API call. This compresses the competitive moat of AI infrastructure startups and raises the bar for what "good" looks like to enterprise buyers.

Prompt engineering matters less; task specification matters more

Reasoning models are better at inferring intent from underspecified prompts. The differentiating skill is now defining the right task boundary โ€” what the model should and shouldn't try to reason through โ€” not crafting elaborate prompts to compensate for model limitations.

Latency becomes the new constraint

o3 thinking on a hard problem can take 30โ€“120 seconds. That's fine for asynchronous workflows; it's unusable for real-time UX. The architecture split between fast-path (GPT-4o) and deep-reasoning (o3/o4-mini) is becoming a standard design pattern in production AI systems.

The Investor Angle: What Reasoning Models Do to AI Valuations

I track AI company valuations at the AI Valuations dashboard and the trend is clear: reasoning model capability is being priced as a category separator, not a feature. OpenAI crossed $10B ARR in early 2026 โ€” the fastest SaaS company to that milestone โ€” and the o-series models are the primary driver of enterprise upsell.

For startups, the implication is straightforward: if your product's value proposition is "we use AI to do X," and X is now achievable with an o4-mini API call, your moat is thinner than it was 12 months ago. The startups holding their ground are the ones with proprietary data, workflow lock-in, and vertical depth that can't be replicated by upgrading a model.

Conversely, if you're building in a domain where reasoning model capability unlocks a genuinely new workflow โ€” one that required humans before โ€” you're in the most interesting part of the market right now. Legal, scientific research, complex financial modeling, and engineering design are all seeing that shift in real time.

The model is no longer the product differentiator.

The data, workflow integration, and domain depth are. o3 and o4-mini just raised the floor โ€” the ceiling is still yours to build.

Track AI company valuations and frontier model pricing at the AI Valuations Dashboard and AI Spending Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.

Frequently Asked Questions

What is the OpenAI o3 reasoning model?

OpenAI o3 is a frontier reasoning model released in April 2025 that uses extended internal chain-of-thought to solve complex problems. It scores 87.5% on the ARC-AGI benchmark and ~88% on GPQA Diamond (PhD-level science questions), outperforming o1 on nearly every benchmark. It's designed for tasks requiring multi-step reasoning, not just pattern matching.

What is the difference between o3 and o4-mini?

o3 is OpenAI's most capable reasoning model โ€” highest benchmark scores, longest thinking budget, best for complex research, coding, and analysis. o4-mini is a smaller, faster, cheaper version targeting ~80โ€“90% of o3's performance at roughly 9x lower input token cost ($1.10/M vs $10/M). Most production applications should default to o4-mini unless they need o3's ceiling.

How much does OpenAI o3 cost per API call?

As of mid-2026, o3 is priced at approximately $10/M input tokens and $40/M output tokens. o4-mini is $1.10/M input and $4.40/M output. Reasoning tokens (the internal thinking steps) are billed at the output token rate and can represent 10โ€“50% of total token usage on complex queries, so effective cost per task can be significantly higher than the headline rates suggest.

What use cases are reasoning models best for?

Reasoning models like o3 and o4-mini excel at multi-step code generation and debugging, mathematical reasoning, scientific analysis, legal and financial document review, and any task where the answer requires breaking down subproblems. They underperform on pure speed tasks, simple factual retrieval, and high-volume low-complexity queries where GPT-4o is cheaper and faster.

Should I build my AI application on o3, o4-mini, or GPT-4o?

For most production AI applications in 2026, o4-mini is the right default for reasoning-heavy workflows โ€” strong capability, manageable cost. GPT-4o remains better for high-throughput, lower-complexity tasks where latency and cost matter more than deep reasoning. Use o3 only when you need maximum capability and can justify the cost (e.g., one-shot complex analysis, not real-time user queries).

Explore 45+ free VC tools, dashboards, and recommended startup software.