The AI industry spent 2022–2024 obsessing over training compute. The next $500B constraint is inference — and most people are still not paying attention.
Training a frontier model like GPT-4 costs an estimated $50M–$100M in compute. That is a large number that makes headlines. But OpenAI alone was reportedly spending over $700,000 per day on inference in mid-2023 — before ChatGPT scaled to 100 million weekly actives. Run those numbers forward two years and the inference bill dwarfs training by an order of magnitude. The bottleneck shifted and most investors have not updated their mental model.
Training vs. Inference: The Numbers That Matter
The confusion between training and inference costs is one of the most persistent mistakes in AI analysis. Training is a capital expense — you pay once to build the model. Inference is an operating expense that scales with every query. At consumer scale, inference wins the cost race by a landslide.
$50M–$100M
GPT-4 estimated training cost
One-time compute expense
$700K+/day
OpenAI inference spend (2023 est.)
Scales with every query
$250M+/yr
Annual inference bill at scale
Before Sora, o-series reasoning models
~70–80%
Inference share of total AI compute
As models reach production scale
Why Inference Is Structurally Harder Than Training
Training is a batched, predictable workload. You spin up 10,000 GPUs, run a job for weeks, and you are done. Inference is latency-sensitive, bursty, and economically brutal at scale. Every millisecond of latency is a product problem. Every wasted GPU cycle is margin destruction.
- →
Latency requirements are brutal
Consumer AI products need sub-200ms responses. Reasoning models like o3 that chain multiple inference passes are burning 10–50x more compute per query than simple completions — at exactly the moment users expect the fastest experience.
- →
Load is unpredictable
Training runs on a fixed schedule. Inference spikes when a tweet goes viral, when a product launches, when a news event breaks. Overprovisioning costs money. Underprovisioning destroys retention. Neither is acceptable at scale.
- →
Model size is the wrong lever
The naive solution — smaller models — trades capability for cost. The smarter solution is model distillation, speculative decoding, quantization, and caching. These are engineering investments most startups are not making.
- →
Specialized chips are winning
NVIDIA GPUs were designed for training. Groq's LPUs, Google's TPUs, and AWS Inferentia chips are purpose-built for inference throughput — and they are showing 3–10x cost advantages in production workloads.
The Chip War Has a New Battlefield
NVIDIA built its dominance on training. H100 clusters fill data centers because they are the best available option for large-scale training runs. But inference is a different competitive game — and the incumbents are not guaranteed to win it.
✓ Training dominance, massive installed base
✕ Not optimized for inference latency vs. throughput
✓ Inference-optimized, powers Gemini at scale
✕ Only accessible through Google Cloud
✓ Deep cloud integration, competitive pricing
✕ Still catching up on software ecosystem
✓ 500+ tokens/sec per chip, lowest latency available
✕ Capital-intensive, limited scale vs. hyperscalers
✓ Fastest throughput for large models
✕ Narrow addressable market, not a broad solution
What This Means for Founders and Investors
I have sat across from dozens of AI founders in the last 18 months who could tell me exactly what it cost to fine-tune their model but had no idea what it would cost to serve 1 million daily active users. That is an existential blind spot.
The unit economics of AI products are determined at inference time, not training time. A startup with a 70% gross margin at $500K ARR can watch that margin evaporate to 15% at $5M ARR if inference costs scale linearly with usage. The companies that survive will be those that treat inference optimization as a core engineering competency from day one.
Inference-Aware Strategies That Work
- ✓ Model distillation — smaller models for common queries
- ✓ Aggressive semantic caching of repeated requests
- ✓ Speculative decoding to reduce latency at low cost
- ✓ Usage-based pricing tied to actual compute consumed
- ✓ Routing queries to the cheapest model that meets quality bar
Patterns That Blow Up at Scale
- ✕ Sending every query to the largest frontier model
- ✕ Flat subscription pricing with unlimited API calls
- ✕ No caching layer — paying full price for identical queries
- ✕ Assuming inference costs stay constant as models get more capable
- ✕ Ignoring latency until it becomes a retention problem
Training built the models. Inference will determine who profits from them.
The AI companies that master inference economics will generate durable margins. Everyone else is building on a cost structure that gets worse as they grow.
Track AI infrastructure trends and investment themes at Value Add VC. Originally published in the Trace Cohen newsletter.