GPT-4 scored in the 90th percentile on the US bar exam. AI systems are outperforming average radiologists on mammography classification. Frontier coding models solve over 80% of competitive programming benchmarks. The question is no longer whether AI can match human experts โ it's whether benchmarks are measuring the right thing at all.
Where the Data Actually Shows AI Winning
The benchmark results are real, and founders and investors who dismiss them as hype are making a mistake. In 2023, a Stanford study found that GPT-4 outperformed physicians on clinical vignette questions โ scoring 87% vs. a 73% average for board-certified doctors. Google's Med-PaLM 2 reached 86.5% on the US Medical Licensing Exam, above the passing threshold and in the range of expert physicians.
In legal: Harvey AI, built on GPT-4, completed first-pass contract review at 80% accuracy on standard commercial agreements โ matching junior associate performance at roughly 1/100th the cost per document. In code: DeepMind's AlphaCode 2 ranked in the top 15% of competitive programmers on Codeforces. OpenAI's o3 scored 87.5% on ARC-AGI, a benchmark explicitly designed to resist pattern memorization.
I've seen this firsthand across portfolio companies. One legal tech startup deploying AI for contract redline analysis saw AI flag 94% of material issues that junior lawyers caught โ and flag 11% additional issues the lawyers missed entirely. That's not a marginal improvement. That's a structural change in how legal work gets done.
The Domains Where AI Has Crossed the Expert Threshold
- โขRadiology (narrow classification): AI achieves sensitivity of 94.5% vs. 88.0% for radiologists on diabetic retinopathy screening โ Google Health, 2019 Lancet study. The key word is "narrow": single-modality, single-condition classification tasks.
- โขLegal contract review: AI models identify standard clause deviations with 85-92% precision on templated commercial contracts, matching or exceeding the 82% baseline accuracy measured for first-year associates by Thomson Reuters research.
- โขStandardized exams: GPT-4 passes the bar (90th percentile), USMLE (passing threshold), CFA Level 1 (estimated 75%+ correct), and CPA exam (estimated 78% correct). These are real credentials โ not toy benchmarks.
- โขSoftware engineering: GitHub Copilot users complete tasks 55.8% faster in Microsoft's own study. Claude and GPT-4 solve 60-80% of LeetCode hard problems correctly on first attempt โ roughly mid-level engineer performance on algorithmic challenges.
- โขCustomer support (Tier 1): Klarna's AI assistant handled the work of 700 FTEs in its first month, resolving 2.3 million conversations โ at 78% customer satisfaction vs. 82% for human agents. The 4-point gap is shrinking every quarter.
Why Benchmarks Are a Lie โ and What to Measure Instead
Here's the inconvenient reality: benchmark performance and production reliability are two fundamentally different things, and conflating them is one of the most expensive mistakes an enterprise can make in 2026.
The gap between benchmark accuracy and real-world production accuracy is typically 15-30 percentage points in high-stakes domains. Why? Benchmarks are static, curated datasets โ often with known answers in the training distribution. Models can effectively memorize adjacent patterns. Real-world inputs are messy, ambiguous, and adversarial in ways benchmarks don't capture.
I watched a healthcare AI company pitch 94% accuracy on a published benchmark โ then deploy to a hospital system and hit 71% on live patient data. The 23-point drop wasn't a failure of the model; it was a failure of benchmark design. The benchmark didn't account for imaging equipment variation, patient demographic shifts, or rare condition distributions in that specific hospital's population.
The right metrics for AI deployment are: precision and recall on your specific data distribution, error mode analysis (what types of mistakes does it make, and are they recoverable?), and performance degradation on edge-case inputs. Benchmark leaderboard position tells you nothing about any of these.
The Investment Implication: Where Human-AI Teams Win
The companies building durable businesses aren't replacing experts โ they're building human-AI workflows that multiply expert throughput by 3-10x. That's where the real value is, and the data backs it up.
McKinsey's 2025 State of AI report found that companies using AI as an expert augmentation layer โ rather than a replacement โ saw 28% higher productivity gains and 40% lower error rates than companies deploying AI autonomously on high-stakes tasks. The radiologist who reviews AI flagged cases processes 4x the volume. The lawyer who delegates first-pass contract review to AI focuses on strategy and negotiation. The engineer who uses Copilot for boilerplate ships twice as fast on architecture decisions.
From an investing lens: I'm most interested in companies that have instrumented this productivity multiplier into their core product metric โ and can demonstrate it with real customer data, not benchmark claims. Startups in legal tech, medical imaging, financial analysis, and code generation that show 3x expert throughput with maintained or improved accuracy on production data are the ones that will build durable category positions.
The question was never "AI vs. human." The question is: which companies are building AI that makes their best humans 10x more productive โ and can prove it in production, not on a benchmark leaderboard?
Stay current with VC and startup trends at Value Add VC. Originally published in the Trace Cohen newsletter.