I've reviewed hundreds of AI company pitch decks. At least 70% of them have the same slide: "Our proprietary dataset creates an insurmountable moat." Almost none of them are right.
The data moat is the most abused concept in venture capital right now. It was a genuinely powerful idea in 2016. It is a mostly outdated idea in 2026. And founders who build their entire competitive thesis around it are walking into a trap.
How the Data Moat Became Gospel
The data moat thesis has legitimate historical roots. Between 2012 and 2019, collecting labeled training data at scale was genuinely hard. Storage was expensive. Labeling was manual and costly โ Scale AI charged $0.04โ$0.20 per label. Building the infrastructure to store, clean, and version a 10TB dataset required a dedicated ML platform team.
Google, Facebook, and Amazon built trillion-dollar businesses on data network effects. The lesson was clear: whoever collects the most data wins. That inference was reasonable for the infrastructure era. It does not apply to the foundation model era.
GPT-4 was trained on roughly 13 trillion tokens โ effectively the entire indexed internet plus licensed book and code repositories. The "proprietary" web dataset your team spent six months scraping was almost certainly in that training set. The information asymmetry that made raw data collection valuable evaporated the moment frontier models were trained on everything.
Three Forces That Killed the Static Data Moat
Synthetic Data Generation
Gartner projects synthetic data will account for 60% of AI training data by 2028, up from under 5% in 2021. If a competitor can generate 100 million synthetic examples of your domain for $50,000, your 2 million hand-labeled examples are not a moat โ they're a head start measured in months.
Fine-Tuning Cost Collapse
In 2021, fine-tuning a frontier-class model on a custom domain required millions of dollars in compute. In 2026, fine-tuning a production-grade open-source model on 10,000 domain-specific examples costs under $2,000 and takes hours. Your competitor can match your model's performance six months after you ship.
Data Marketplace Commoditization
Scale AI, Appen, Toloka, and dozens of vertical-specific data marketplaces now sell labeled training datasets across healthcare, legal, finance, logistics, and manufacturing. If your "proprietary" data looks like something you could license, your moat has already been priced into the market.
What Actually Creates a Defensible Data Advantage
Real data moats exist. They just look nothing like a static dataset sitting in an S3 bucket. Here's what I look for across my 65+ investments when a founder claims a data advantage:
- โขFeedback loop data: Data generated when users act on your model's predictions โ clicks, conversions, corrections, escalations. Every user interaction updates the model. Churn literally degrades performance, which means the data and the product are the same asset.
- โขReal-time operational data: Data that is only valuable because of its recency. Today's pricing signals, live logistics route optimization, real-time fraud patterns. This cannot be replicated from a historical dataset because staleness destroys the value.
- โขPrivileged access data: Data with legal or contractual exclusivity โ EHR records under a HIPAA Business Associate Agreement, satellite imagery under an exclusive government contract, proprietary financial transaction streams. If accessing it requires years of legal work, it's genuinely hard to replicate.
- โขRare domain edge case accumulation: Data that only a scaled market leader generates โ fraud patterns visible only after processing $50B in transactions, rare equipment failure signatures visible only after monitoring 100,000 industrial sensors for five years. You cannot buy this and you cannot synthesize it accurately.
- โขStructured interaction logs from real expert workflows: Multi-turn decision traces from licensed professionals โ how an experienced radiologist navigates a complex scan, how a senior underwriter prices a non-standard risk. This data requires both domain expertise and production deployment to generate.
The Companies Getting It Right
The best AI data moats I've seen are not databases โ they are flywheels. Toast processes payments and inventory for over 120,000 restaurants. The moat is not that they have restaurant transaction records. The moat is that 120,000 operators are constantly teaching the system what works in their specific context โ menu pricing elasticity in suburban New Jersey versus downtown Manhattan, staffing patterns during NFL Sundays, supplier price fluctuations for proteins.
Palantir's actual moat is similar. The dataset is not the advantage; the operator ontology โ the accumulated schema of how human experts make decisions inside complex organizations โ is the advantage. That structure was built by watching thousands of analysts work inside their software for a decade.
In both cases, the data got better because users kept using the product. The moment usage stopped, the data advantage would begin decaying. That decay mechanism is not a bug โ it's exactly what makes the moat real. A competitor can replicate your dataset. They cannot replicate your installed base generating fresh signal every day.
The Right Questions to Ask About Data Defensibility
The data moat you're pitching is probably not a moat.
A real moat is a feedback loop where every customer action makes your model better and every churned customer makes it worse. That's the flywheel. Everything else is just a pile of files with a timer running on them.
Tracking AI investment trends at Value Add VC. Originally published in the Trace Cohen newsletter.