Everyone is racing to collect more data. The smartest companies figured out you don't have to.
Gartner estimates that by 2026, 60% of data used to train AI models will be synthetically generated. Meta acquired a ~$14.8B stake in Scale AI โ a company whose core value prop is producing high-quality labeled and synthetic training data at industrial scale. OpenAI has used synthetic data pipelines to train GPT-4 and its successors. Waymo logs billions of simulated miles for every real mile its vehicles drive.
This isn't an academic footnote. It's the actual mechanism behind some of the best-funded AI bets of the last two years. And most founders are still treating data as a procurement problem instead of an engineering one.
What Synthetic Data Actually Is
Synthetic data is algorithmically generated data designed to replicate the statistical distributions of real-world data without containing any real records. It can be images, text, tabular data, time-series, sensor readings โ virtually any modality. The generation engine might be a GAN, a diffusion model, a rule-based simulator, or an LLM prompted to produce domain-specific examples.
The synthetic data market was valued at roughly $350M in 2023. Forecasts put it at $2.7B by 2030 โ driven almost entirely by enterprise AI adoption. That growth rate is not a coincidence. It tracks precisely with the explosion in model training spend and the recognition that "getting more real data" hits a wall fast in regulated or constrained domains.
Why Real Data Alone Is a Dead End
There are three problems with relying entirely on real-world data:
- 1.Scarcity of edge cases. The scenarios that break your model โ the rare fraud pattern, the unusual medical scan, the edge-case sensor reading โ are precisely the ones you have the least data on. You can't wait for reality to generate enough examples. You have to manufacture them.
- 2.Privacy and regulatory constraints. In healthcare, finance, and legal tech, the data you need most is often the data you're least allowed to use. HIPAA, GDPR, and CCPA don't care about your training pipeline. Synthetic data sidesteps the compliance problem entirely.
- 3.Labeling cost and speed. Human annotation is expensive, inconsistent, and slow. The average cost of labeling a medical imaging dataset at production quality runs $0.30โ$1.50 per image. For datasets requiring 1M+ examples, that math stops working fast.
The Moat Is the Generation Engine
Here is where it gets interesting from an investment thesis perspective. Synthetic data is not just a cost-reduction tool โ it's a compounding moat builder. The companies that build high-fidelity synthetic data pipelines early accumulate something competitors cannot easily replicate: a generation engine tuned to their specific domain.
Consider what this looks like in practice:
Autonomous Vehicles โ Waymo & NVIDIA
Billions of simulated edge-case driving scenarios โ narrow roads, unusual weather, pedestrian behavior โ that would take decades to encounter naturally
Healthcare AI โ PathAI, Viz.ai
Synthetic pathology and radiology images calibrated to match rare disease presentations that appear in fewer than 0.1% of real scans
Financial Fraud โ Featurespace, Mastercard
Synthetic transaction sequences mimicking emerging fraud patterns before they appear at scale in live data
Robotics โ Figure, 1X
Photorealistic simulation environments that let robots train on millions of manipulation tasks without a single real-world trial
What This Means for Founders Building AI Products
If you are building a vertical AI company and your data strategy is "collect more real-world examples," you are one regulatory change or competitive pivot away from a stalled roadmap. The founders who are winning treat data generation as a core engineering capability, not an ops function.
โ Invest early in domain-specific generators
Even a basic rule-based augmentation pipeline compounds significantly over 12โ24 months of model iteration
โ Use synthetic data to stress-test for rare events
Your model will be judged on its worst day, not its average. Engineer for the 0.1% scenarios before your customers find them
โ Build privacy-safe by default
Synthetic pipelines eliminate PII risk in training data โ a non-trivial advantage when selling into regulated markets
โ Treat the generator as a product
The best AI companies โ Scale, Waymo, Palantir โ treat their data generation capability as a first-class product, not a cost center
The AI companies with the deepest moats aren't sitting on the most data.
They're sitting on the best engines for generating it โ and that gap widens every quarter.
Track the AI investment landscape at Value Add VC. Originally published in the Trace Cohen newsletter.