The cost to run a large language model at GPT-4-class quality has fallen about 95% in two years — and roughly 1,000x in three. That's the short answer. The longer answer is more interesting.
When GPT-4 launched in March 2023, it cost $30 per million input tokens and $60 per million output tokens. By mid-2026, you can get equal-or-better quality from open-weight models for under $0.50 per million tokens — and small efficient models like GPT-4o mini undercut even that. This is the single most important cost curve in software, and almost nobody's portfolio model has caught up to it.
AI Inference Cost Reduction in 2026: The Short Version
AI inference cost reduction in 2026 means roughly a 95% decline over two years and close to 1,000x over three for a fixed capability level. The price to produce GPT-4-quality output fell from about $30 per million input tokens in 2023 to under $0.50 in 2026, driven by algorithmic efficiency, cheaper hardware, and open-weight competition. The frontier still costs money — but yesterday's frontier is nearly free.
a16z named this phenomenon "LLMflation" — the inverse of inflation, where the same unit of intelligence gets dramatically cheaper every year. Their late-2024 analysis pegged the decline at roughly 10x per year for a fixed quality bar. Eighteen months later, that pace has held. The implication is brutal for anyone who priced their AI product around 2023 token costs and comforting for anyone building margin-sensitive applications today.
The Price History: What Inference Actually Cost, Year by Year
The cleanest way to see the AI inference cost reduction is to track flagship API pricing against the open-model floor. The table below shows per-million-token pricing (blended input/output where a single figure is shown) for representative models at launch.
| Model (launch) | Input $/M | Output $/M | Tier |
|---|---|---|---|
| GPT-4 (Mar 2023) | $30.00 | $60.00 | Frontier flagship |
| GPT-4 Turbo (Nov 2023) | $10.00 | $30.00 | Frontier flagship |
| GPT-4o (May 2024) | $5.00 | $15.00 | Frontier flagship |
| GPT-4o mini (Jul 2024) | $0.15 | $0.60 | Small / efficient |
| Gemini 1.5 Flash (2024) | $0.075 | $0.30 | Small / efficient |
| Llama 3.3 70B (host, 2025) | $0.20 | $0.40 | Open weight |
| DeepSeek V3 (2025) | $0.27 | $1.10 | Open weight |
| GPT-4-class floor (2026) | ~$0.10 | ~$0.40 | Open weight |
Figures are launch-list API prices blended from OpenAI, Google, and Anthropic published pricing, plus hosted open-model rates from Together AI, Fireworks, and DeepSeek (2023–2026). The 2026 "floor" reflects commodity hosting of quantized 70B-class open weights matching the original GPT-4 on MMLU.
Read the first and last rows together: $30/$60 in 2023 versus roughly $0.10/$0.40 in 2026. That's a ~99% drop in input cost and a ~99% drop in output cost for the same 2023-frontier quality bar — comfortably past the "95% in two years" headline once you stretch to three years.
What's Driving the AI Inference Cost Reduction
Three forces compound on top of each other. None alone explains a 1,000x decline — together they do.
Algorithmic efficiency
Smaller models now match older large ones. Quantization (FP8/INT4), speculative decoding, and mixture-of-experts routing cut compute per token 5–20x without quality loss.
Cheaper, faster hardware
Nvidia H100 → H200 → Blackwell B200 raised inference throughput per dollar an estimated 25x+ over three years. Custom silicon (Trainium2, TPU v6, Inferentia) adds price pressure.
Open-weight competition
Llama, DeepSeek, Qwen, and Mistral set a near-zero-margin price floor. Closed labs must price the newest frontier against free-to-self-host alternatives one tier behind.
The hardware curve alone is staggering. Nvidia claims Blackwell delivers up to 25x lower cost and energy per inference token versus the Hopper generation for large MoE models. Stack quantization and better serving software on top, and a query that cost $0.10 of compute in 2023 can cost well under a cent in 2026 — before you even switch to a smaller model.
The Jevons Paradox: Why Cheaper Tokens Mean Bigger Bills
Here's the counterintuitive part that trips up founders and LPs alike: a 95% AI inference cost reduction has not lowered most companies' AI bills. It has raised them. This is the Jevons paradox — when a resource gets cheaper, you use far more of it.
Reasoning models are the clearest example. An o3- or GPT-5-class model can burn 10x to 50x more tokens per query than a 2023 chat completion because it "thinks" through long chains before answering. So even as the price per token falls 95%, a single complex query can cost more than it did in 2023. Add agentic workflows that fire dozens of model calls per task, and aggregate token consumption is growing far faster than unit price is falling.
The practical takeaway: cheaper inference is unlocking entirely new product categories — coding agents, deep research, document-scale analysis — that were economically impossible at 2023 prices. The cost curve isn't deflating the AI market; it's expanding the addressable surface. You can see the spending side of this on the AI Spending dashboard and how it flows into private valuations on AI Valuations.
What the AI Inference Cost Reduction Means for Founders and Investors
What gets easier
- ✓ Margin-sensitive apps that were uneconomic at $30/M tokens now clear 70%+ gross margins
- ✓ Free and freemium AI tiers become sustainable as the model floor approaches zero
- ✓ Self-hosting open weights gives cost control and removes per-token vendor lock-in
- ✓ New categories — agents, deep research, real-time voice — become viable
What gets harder
- ✕ "We resell tokens at a markup" is not a business — the markup compresses to zero
- ✕ Models priced around 2023 costs are leaving 90%+ of gross margin on the table
- ✕ Frontier-lab unit economics stay brutal as they fund the next training run
- ✕ Token-based pricing to customers erodes as buyers expect annual price cuts
For investors, the durable question is no longer "how cheap is inference?" — it's heading to zero. The question is what you build on top of near-free intelligence: proprietary data, workflow ownership, and distribution. The model is a commodity; the moat is everything around it.
What Happens Next: 2027 and Beyond
Expect the fixed-capability cost curve to keep dropping roughly 4x to 10x per year through 2027, driven by the same three forces — though the easy quantization gains are mostly spent, so the rate may soften toward 4x. Three things to watch:
Reasoning-token deflation
The next efficiency frontier is making models think more cheaply, not just talk cheaply. Cheaper reasoning tokens will matter more than cheaper chat tokens by 2027.
Inference-optimized silicon
Blackwell, Trainium2, and inference-specific startups (Groq, Cerebras, Etched) target order-of-magnitude throughput gains for serving, not training.
The frontier premium persists
The newest, best model will always cost real money. The gap between frontier and floor is the price of being 12–18 months ahead.
Inference is deflating ~10x a year. Your moat can't be the model.
Build on proprietary data, workflow ownership, and distribution — the things that don't get 95% cheaper.
Track AI spending and private valuations on the AI Spending and AI Valuations dashboards at Value Add VC. Originally published in the Trace Cohen newsletter.