Why is inference becoming more expensive than training for AI companies?

Training a model is a one-time cost, but inference runs continuously at scale — every user query, every API call, every automated workflow. At hundreds of millions of daily active users, inference compute compounds into costs that dwarf the original training bill by orders of magnitude.

What does the inference bottleneck mean for AI startups?

Startups building on top of third-party models face margin compression as query volume grows — inference costs scale with usage but pricing often doesn't. The winners will be those who optimize inference aggressively, build proprietary chips, or move to smaller specialized models that run cheaper at scale.

Which companies are best positioned to win the inference compute race?

NVIDIA dominates today with H100 and H200 GPUs, but custom silicon from Google (TPUs), AWS (Trainium/Inferentia), and startups like Groq (LPUs) and Cerebras are gaining ground. Whoever achieves the best tokens-per-second-per-dollar at scale wins the inference era.

How does inference economics affect the AI investment thesis?

VCs should scrutinize inference cost structure before writing checks — a startup with 80% gross margins at $1M ARR can see those margins collapse to 20% as they scale if inference spend is unchecked. Sustainable unit economics require either proprietary infrastructure, model distillation, or aggressive caching strategies.

Inference Is the New Compute: The Next AI Bottleneck

The AI industry spent 2022–2024 obsessing over training compute. The next $500B constraint is inference — and most people are still not paying attention.

Training a frontier model like GPT-4 costs an estimated $50M–$100M in compute. That is a large number that makes headlines. But OpenAI alone was reportedly spending over $700,000 per day on inference in mid-2023 — before ChatGPT scaled to 100 million weekly actives. Run those numbers forward two years and the inference bill dwarfs training by an order of magnitude. The bottleneck shifted and most investors have not updated their mental model.

Training vs. Inference: The Numbers That Matter

The confusion between training and inference costs is one of the most persistent mistakes in AI analysis. Training is a capital expense — you pay once to build the model. Inference is an operating expense that scales with every query. At consumer scale, inference wins the cost race by a landslide.

$50M–$100M

GPT-4 estimated training cost

One-time compute expense

$700K+/day

OpenAI inference spend (2023 est.)

Scales with every query

$250M+/yr

Annual inference bill at scale

Before Sora, o-series reasoning models

~70–80%

Inference share of total AI compute

As models reach production scale

Why Inference Is Structurally Harder Than Training

Training is a batched, predictable workload. You spin up 10,000 GPUs, run a job for weeks, and you are done. Inference is latency-sensitive, bursty, and economically brutal at scale. Every millisecond of latency is a product problem. Every wasted GPU cycle is margin destruction.

→
Latency requirements are brutal
Consumer AI products need sub-200ms responses. Reasoning models like o3 that chain multiple inference passes are burning 10–50x more compute per query than simple completions — at exactly the moment users expect the fastest experience.
→
Load is unpredictable
Training runs on a fixed schedule. Inference spikes when a tweet goes viral, when a product launches, when a news event breaks. Overprovisioning costs money. Underprovisioning destroys retention. Neither is acceptable at scale.
→
Model size is the wrong lever
The naive solution — smaller models — trades capability for cost. The smarter solution is model distillation, speculative decoding, quantization, and caching. These are engineering investments most startups are not making.
→
Specialized chips are winning
NVIDIA GPUs were designed for training. Groq's LPUs, Google's TPUs, and AWS Inferentia chips are purpose-built for inference throughput — and they are showing 3–10x cost advantages in production workloads.

The Chip War Has a New Battlefield

NVIDIA built its dominance on training. H100 clusters fill data centers because they are the best available option for large-scale training runs. But inference is a different competitive game — and the incumbents are not guaranteed to win it.

NVIDIAH100 / H200 / B200

✓ Training dominance, massive installed base

✕ Not optimized for inference latency vs. throughput

GoogleTPU v5e / v5p

✓ Inference-optimized, powers Gemini at scale

✕ Only accessible through Google Cloud

AWSTrainium2 / Inferentia2

✓ Deep cloud integration, competitive pricing

✕ Still catching up on software ecosystem

GroqLanguage Processing Unit (LPU)

✓ 500+ tokens/sec per chip, lowest latency available

✕ Capital-intensive, limited scale vs. hyperscalers

CerebrasWSE-3 (Wafer Scale Engine)

✓ Fastest throughput for large models

✕ Narrow addressable market, not a broad solution

What This Means for Founders and Investors

I have sat across from dozens of AI founders in the last 18 months who could tell me exactly what it cost to fine-tune their model but had no idea what it would cost to serve 1 million daily active users. That is an existential blind spot.

The unit economics of AI products are determined at inference time, not training time. A startup with a 70% gross margin at $500K ARR can watch that margin evaporate to 15% at $5M ARR if inference costs scale linearly with usage. The companies that survive will be those that treat inference optimization as a core engineering competency from day one.

Inference-Aware Strategies That Work

✓ Model distillation — smaller models for common queries
✓ Aggressive semantic caching of repeated requests
✓ Speculative decoding to reduce latency at low cost
✓ Usage-based pricing tied to actual compute consumed
✓ Routing queries to the cheapest model that meets quality bar

Patterns That Blow Up at Scale

✕ Sending every query to the largest frontier model
✕ Flat subscription pricing with unlimited API calls
✕ No caching layer — paying full price for identical queries
✕ Assuming inference costs stay constant as models get more capable
✕ Ignoring latency until it becomes a retention problem

Training built the models. Inference will determine who profits from them.

The AI companies that master inference economics will generate durable margins. Everyone else is building on a cost structure that gets worse as they grow.

Track AI infrastructure trends and investment themes at Value Add VC. Originally published in the Trace Cohen newsletter.

Inference Is the New Compute: The Next AI Bottleneck

Training vs. Inference: The Numbers That Matter

Why Inference Is Structurally Harder Than Training

The Chip War Has a New Battlefield

What This Means for Founders and Investors

Frequently Asked Questions

Keep Reading