What is the difference between o3 and o4-mini?

o3 is the larger, more capable flagship reasoning model, while o4-mini is a smaller, faster, and cheaper model optimized for high-volume math and coding tasks. o4-mini actually matches or beats o3 on some math benchmarks like AIME while costing about a quarter as much — roughly $1.10 per million input tokens versus o3's $2. For most production workloads, o4-mini is the better default and you escalate to o3 only when needed.

How much does the OpenAI o3 reasoning model cost?

After an 80% price cut in June 2025, OpenAI o3 costs about $2 per million input tokens and $8 per million output tokens through the API. o4-mini is cheaper at roughly $1.10 input and $4.40 output per million tokens. Because reasoning models generate hidden thinking tokens that you pay for as output, real-world cost per query is higher than the headline rate suggests.

Can OpenAI o3 use tools during reasoning?

Yes. o3 and o4-mini are the first OpenAI models that can agentically call tools — web search, Python code execution, file analysis, and image generation — inside a single chain of thought. The model decides when to search the web or run code mid-reasoning, then incorporates the result. This is what makes them effective for agentic AI applications rather than just single-turn answers.

When should you use a reasoning model instead of GPT-4o?

Use a reasoning model like o3 or o4-mini for tasks requiring multi-step logic — complex coding, math, scientific analysis, or agentic workflows — where accuracy matters more than speed. For simple chat, summarization, or latency-sensitive tasks, a standard model like GPT-4o is faster and cheaper. The rule of thumb: reasoning models are worth their 5-30 second latency only when a wrong answer is expensive.

OpenAI o3: 69.1% SWE-bench, 2,700+ Codeforces Elo

Q: What is the OpenAI o3 reasoning model?

OpenAI o3 is a reasoning model released on April 16, 2025 that generates an internal chain of thought before answering, trading latency for accuracy on hard problems. It scores 69.1% on SWE-bench Verified and over 2,700 Elo on Codeforces, placing it among the strongest coding and math models available. Unlike standard chat models, o3 can call tools like web search and Python during its reasoning.

OpenAI's o3 is a reasoning model that thinks before it answers — and on the hardest coding benchmark, SWE-bench Verified, it solves 69.1% of real GitHub issues, up from o1's 48.9%. Its smaller sibling o4-mini hits nearly the same scores for about a quarter of the cost. That's the short answer. The longer answer is more interesting.

Released April 16, 2025, o3 and o4-mini were the moment OpenAI's reasoning line stopped being a research demo and became the default choice for serious applications. The reason is not just higher benchmark numbers. It is that these were the first OpenAI models that could call tools — search the web, run Python, read an image — in the middle of their own chain of thought. That changes what you can build.

What Is the OpenAI o3 Reasoning Model?

The OpenAI o3 reasoning model is a large language model that generates an internal chain of thought before producing an answer, trading a few seconds of latency for substantially higher accuracy on hard problems. Released April 16, 2025, it scores 69.1% on SWE-bench Verified and over 2,700 Elo on Codeforces, and it can invoke tools like web search and Python during reasoning rather than only at the end.

The distinction from a standard model like GPT-4o matters. A chat model maps your prompt directly to an answer in one pass. A reasoning model spends "thinking tokens" first — exploring approaches, checking its own work, backtracking — and only then commits. For a factual lookup that is wasted effort. For a multi-step math proof or a gnarly debugging task, it is the difference between a confident wrong answer and a correct one.

o3 succeeded o1 and o3-mini, and shipped alongside o4-mini, a smaller and cheaper reasoning model. Both carry a 200K-token context window and can return up to 100K output tokens — enough to reason through and rewrite an entire codebase file in a single call.

OpenAI o3 vs o4-mini: Reasoning Model Benchmarks Compared

The surprising part of the April 2025 release was how close o4-mini runs to o3 despite costing a fraction as much. On pure math, o4-mini actually edges ahead. o3 keeps its lead on the broadest, hardest tasks — agentic coding, scientific reasoning, and visual problems — but for high-volume production work, the cheaper model is often the smarter default.

Benchmark	o3	o4-mini	o1 (prior gen)
SWE-bench Verified (coding)	69.1%	68.1%	48.9%
Codeforces (Elo)	~2,706	~2,719	~1,891
AIME 2025 (math, w/ tools)	~98.4%	~99.5%	~83%
GPQA Diamond (science)	~83.3%	~81.4%	~78%
MMMU (visual reasoning)	~82.9%	~81.6%	~77.6%
Context window	200K tokens	200K tokens	200K tokens

Figures are from OpenAI's April 2025 o3 and o4-mini launch benchmarks and prior o1 system-card results. Math and coding scores use the models' tool-enabled high-effort settings; real-world performance varies with reasoning-effort configuration and prompt design.

o3 Reasoning Model Pricing: What It Actually Costs to Ship

Pricing is where reasoning models went from experimental to deployable. In June 2025, OpenAI cut o3 API pricing by 80%, bringing it to roughly $2 per million input tokens and $8 per million output tokens. o4-mini sits below that at about $1.10 input and $4.40 output. The catch: reasoning models generate hidden thinking tokens you pay for as output, so a single "hard" query can burn 5,000–20,000 output tokens before you see a one-paragraph answer.

Model	Input / 1M tokens	Output / 1M tokens	Best For
o3	~$2.00	~$8.00	Hardest agentic + coding tasks
o3-pro	~$20.00	~$80.00	Max accuracy, latency-tolerant
o4-mini	~$1.10	~$4.40	High-volume math + coding default
GPT-4o	~$2.50	~$10.00	Chat, latency-sensitive tasks
o1 (prior)	~$15.00	~$60.00	Legacy — superseded by o3
GPT-4o mini	~$0.15	~$0.60	Cheapest non-reasoning fallback

Pricing reflects OpenAI's published API rates following the June 2025 o3 price reduction; figures are approximate and exclude cached-input discounts. Reasoning models bill hidden chain-of-thought tokens as output, so effective per-query cost exceeds the headline output rate.

How to Use the o3 Reasoning Model in AI Applications

The headline capability is tool use inside reasoning. o3 and o4-mini can decide, mid-thought, to run a web search, execute Python, analyze an uploaded file, or even generate an image — then fold the result back into the chain. OpenAI calls this "thinking with images" when applied to vision. For builders, it means a single API call can do what previously required a hand-wired agent loop.

01
Tune the reasoning_effort parameter
Low, medium, and high settings trade cost and latency for accuracy. Most production traffic should default to low or medium; reserve high effort for the queries that genuinely need it. The difference can be 3-5x in both token spend and response time.
02
Route, don't default
Use a cheap classifier or GPT-4o mini to decide whether a query needs reasoning at all. Sending every request to o3 is the most common way teams blow their budget — most prompts do not need a chain of thought.
03
Default to o4-mini, escalate to o3
For coding, math, and structured analysis, o4-mini delivers ~95% of o3's quality at ~25% of the cost. Run it as your baseline and fall through to o3 only when o4-mini fails a validation check or low-confidence signal.
04
Budget for hidden tokens
Reasoning tokens are billed but not always shown. Set max_completion_tokens caps and monitor actual output token counts in production — a complex query can quietly cost 10-20x a simple one even at the same model.

What the Reasoning Models Mean for AI Startups and Investors

From where I sit — having backed 65+ companies, many of them AI-native — the o3 generation reset two things. First, it raised the floor on what a thin application layer can do without custom agent infrastructure; a lot of last year's "agent orchestration" startups now compete with a single model call. Second, the 80% price cut made reasoning economically viable at consumer scale, which is why you saw a wave of products switch their default model in mid-2025.

The investing implication is blunt: if your AI startup's moat was "we wrap reasoning behind a nice UI," that moat is now thinner. The durable companies are the ones using these models to do something hard in a specific vertical — where proprietary data, workflow integration, and distribution matter more than the raw model. If you're tracking how AI model capability maps to startup valuations, our AI valuations dashboard follows where the private capital is actually going.

Reasoning models are no longer the expensive option.

At ~$2 per million input tokens and 69.1% on SWE-bench, o3 turned chain-of-thought from a demo into a default.

Route intelligently, default to o4-mini, and reserve o3 for the queries where a wrong answer is expensive.

Track how AI model capability is repricing private markets on the Value Add VC AI valuations dashboard. Originally published in the Trace Cohen newsletter.

Get VC data most people never see — free.

Weekly benchmarks, valuations, and fund data. No spam, unsubscribe anytime.

What Is the OpenAI o3 Reasoning Model?

OpenAI o3 vs o4-mini: Reasoning Model Benchmarks Compared

Benchmark	o3	o4-mini	o1 (prior gen)
SWE-bench Verified (coding)	69.1%	68.1%	48.9%
Codeforces (Elo)	~2,706	~2,719	~1,891
AIME 2025 (math, w/ tools)	~98.4%	~99.5%	~83%
GPQA Diamond (science)	~83.3%	~81.4%	~78%
MMMU (visual reasoning)	~82.9%	~81.6%	~77.6%
Context window	200K tokens	200K tokens	200K tokens

o3 Reasoning Model Pricing: What It Actually Costs to Ship

Model	Input / 1M tokens	Output / 1M tokens	Best For
o3	~$2.00	~$8.00	Hardest agentic + coding tasks
o3-pro	~$20.00	~$80.00	Max accuracy, latency-tolerant
o4-mini	~$1.10	~$4.40	High-volume math + coding default
GPT-4o	~$2.50	~$10.00	Chat, latency-sensitive tasks
o1 (prior)	~$15.00	~$60.00	Legacy — superseded by o3
GPT-4o mini	~$0.15	~$0.60	Cheapest non-reasoning fallback

How to Use the o3 Reasoning Model in AI Applications

01
Tune the reasoning_effort parameter
Low, medium, and high settings trade cost and latency for accuracy. Most production traffic should default to low or medium; reserve high effort for the queries that genuinely need it. The difference can be 3-5x in both token spend and response time.
02
Route, don't default
Use a cheap classifier or GPT-4o mini to decide whether a query needs reasoning at all. Sending every request to o3 is the most common way teams blow their budget — most prompts do not need a chain of thought.
03
Default to o4-mini, escalate to o3
For coding, math, and structured analysis, o4-mini delivers ~95% of o3's quality at ~25% of the cost. Run it as your baseline and fall through to o3 only when o4-mini fails a validation check or low-confidence signal.
04
Budget for hidden tokens
Reasoning tokens are billed but not always shown. Set max_completion_tokens caps and monitor actual output token counts in production — a complex query can quietly cost 10-20x a simple one even at the same model.

What the Reasoning Models Mean for AI Startups and Investors

Reasoning models are no longer the expensive option.

At ~$2 per million input tokens and 69.1% on SWE-bench, o3 turned chain-of-thought from a demo into a default.

Route intelligently, default to o4-mini, and reserve o3 for the queries where a wrong answer is expensive.

Track how AI model capability is repricing private markets on the Value Add VC AI valuations dashboard. Originally published in the Trace Cohen newsletter.

Get VC data most people never see — free.

Weekly benchmarks, valuations, and fund data. No spam, unsubscribe anytime.

OpenAI o3 and o4-mini: What the New Reasoning Models Mean for AI Applications

What Is the OpenAI o3 Reasoning Model?

OpenAI o3 vs o4-mini: Reasoning Model Benchmarks Compared

o3 Reasoning Model Pricing: What It Actually Costs to Ship

How to Use the o3 Reasoning Model in AI Applications

What the Reasoning Models Mean for AI Startups and Investors

Frequently Asked Questions

Keep Reading

OpenAI o3 and o4-mini: What the New Reasoning Models Mean for AI Applications

What Is the OpenAI o3 Reasoning Model?

OpenAI o3 vs o4-mini: Reasoning Model Benchmarks Compared

o3 Reasoning Model Pricing: What It Actually Costs to Ship

How to Use the o3 Reasoning Model in AI Applications

What the Reasoning Models Mean for AI Startups and Investors

Frequently Asked Questions

Keep Reading