AI & TechnologyMay 4, 2026·8 min read

Inference Is the New Compute: The Next AI Bottleneck

Everyone obsessed over training costs. The next trillion-dollar constraint in AI is inference — and whoever controls it sets the economics for the entire industry.

TC
Trace Cohen
3x founder, 65+ investments, building Value Add VC

Quick Answer

Inference — running AI models in production to serve real queries — is rapidly eclipsing training as the primary compute bottleneck. As model usage scales from millions to billions of queries per day, inference costs dwarf training costs by 10x or more, reshaping the competitive landscape for chips, cloud providers, and AI startups alike.

The AI industry spent 2022–2024 obsessing over training compute. The next $500B constraint is inference — and most people are still not paying attention.

Training a frontier model like GPT-4 costs an estimated $50M–$100M in compute. That is a large number that makes headlines. But OpenAI alone was reportedly spending over $700,000 per day on inference in mid-2023 — before ChatGPT scaled to 100 million weekly actives. Run those numbers forward two years and the inference bill dwarfs training by an order of magnitude. The bottleneck shifted and most investors have not updated their mental model.

Training vs. Inference: The Numbers That Matter

The confusion between training and inference costs is one of the most persistent mistakes in AI analysis. Training is a capital expense — you pay once to build the model. Inference is an operating expense that scales with every query. At consumer scale, inference wins the cost race by a landslide.

$50M–$100M

GPT-4 estimated training cost

One-time compute expense

$700K+/day

OpenAI inference spend (2023 est.)

Scales with every query

$250M+/yr

Annual inference bill at scale

Before Sora, o-series reasoning models

~70–80%

Inference share of total AI compute

As models reach production scale

Why Inference Is Structurally Harder Than Training

Training is a batched, predictable workload. You spin up 10,000 GPUs, run a job for weeks, and you are done. Inference is latency-sensitive, bursty, and economically brutal at scale. Every millisecond of latency is a product problem. Every wasted GPU cycle is margin destruction.

  • Latency requirements are brutal

    Consumer AI products need sub-200ms responses. Reasoning models like o3 that chain multiple inference passes are burning 10–50x more compute per query than simple completions — at exactly the moment users expect the fastest experience.

  • Load is unpredictable

    Training runs on a fixed schedule. Inference spikes when a tweet goes viral, when a product launches, when a news event breaks. Overprovisioning costs money. Underprovisioning destroys retention. Neither is acceptable at scale.

  • Model size is the wrong lever

    The naive solution — smaller models — trades capability for cost. The smarter solution is model distillation, speculative decoding, quantization, and caching. These are engineering investments most startups are not making.

  • Specialized chips are winning

    NVIDIA GPUs were designed for training. Groq's LPUs, Google's TPUs, and AWS Inferentia chips are purpose-built for inference throughput — and they are showing 3–10x cost advantages in production workloads.

The Chip War Has a New Battlefield

NVIDIA built its dominance on training. H100 clusters fill data centers because they are the best available option for large-scale training runs. But inference is a different competitive game — and the incumbents are not guaranteed to win it.

NVIDIAH100 / H200 / B200

Training dominance, massive installed base

Not optimized for inference latency vs. throughput

GoogleTPU v5e / v5p

Inference-optimized, powers Gemini at scale

Only accessible through Google Cloud

AWSTrainium2 / Inferentia2

Deep cloud integration, competitive pricing

Still catching up on software ecosystem

GroqLanguage Processing Unit (LPU)

500+ tokens/sec per chip, lowest latency available

Capital-intensive, limited scale vs. hyperscalers

CerebrasWSE-3 (Wafer Scale Engine)

Fastest throughput for large models

Narrow addressable market, not a broad solution

What This Means for Founders and Investors

I have sat across from dozens of AI founders in the last 18 months who could tell me exactly what it cost to fine-tune their model but had no idea what it would cost to serve 1 million daily active users. That is an existential blind spot.

The unit economics of AI products are determined at inference time, not training time. A startup with a 70% gross margin at $500K ARR can watch that margin evaporate to 15% at $5M ARR if inference costs scale linearly with usage. The companies that survive will be those that treat inference optimization as a core engineering competency from day one.

Inference-Aware Strategies That Work

  • ✓ Model distillation — smaller models for common queries
  • ✓ Aggressive semantic caching of repeated requests
  • ✓ Speculative decoding to reduce latency at low cost
  • ✓ Usage-based pricing tied to actual compute consumed
  • ✓ Routing queries to the cheapest model that meets quality bar

Patterns That Blow Up at Scale

  • ✕ Sending every query to the largest frontier model
  • ✕ Flat subscription pricing with unlimited API calls
  • ✕ No caching layer — paying full price for identical queries
  • ✕ Assuming inference costs stay constant as models get more capable
  • ✕ Ignoring latency until it becomes a retention problem

Training built the models. Inference will determine who profits from them.

The AI companies that master inference economics will generate durable margins. Everyone else is building on a cost structure that gets worse as they grow.

Track AI infrastructure trends and investment themes at Value Add VC. Originally published in the Trace Cohen newsletter.

Frequently Asked Questions

Why is inference becoming more expensive than training for AI companies?

Training a model is a one-time cost, but inference runs continuously at scale — every user query, every API call, every automated workflow. At hundreds of millions of daily active users, inference compute compounds into costs that dwarf the original training bill by orders of magnitude.

What does the inference bottleneck mean for AI startups?

Startups building on top of third-party models face margin compression as query volume grows — inference costs scale with usage but pricing often doesn't. The winners will be those who optimize inference aggressively, build proprietary chips, or move to smaller specialized models that run cheaper at scale.

Which companies are best positioned to win the inference compute race?

NVIDIA dominates today with H100 and H200 GPUs, but custom silicon from Google (TPUs), AWS (Trainium/Inferentia), and startups like Groq (LPUs) and Cerebras are gaining ground. Whoever achieves the best tokens-per-second-per-dollar at scale wins the inference era.

How does inference economics affect the AI investment thesis?

VCs should scrutinize inference cost structure before writing checks — a startup with 80% gross margins at $1M ARR can see those margins collapse to 20% as they scale if inference spend is unchecked. Sustainable unit economics require either proprietary infrastructure, model distillation, or aggressive caching strategies.

Explore 41+ free VC tools, dashboards, and recommended startup software.