The venture world has been chasing text AI for three years. ChatGPT, Claude, Gemini — every demo, every pitch deck, every VC memo is about what you can do with a text box.
Meanwhile, business actually runs on voice. Sales happens on calls. Customer support happens on calls. The bulk of field operations, healthcare intake, insurance claims, financial advisory, and logistics coordination — all phone calls. Over 100 billion business calls are made in the United States every year. The global contact center industry burns through $400B+ annually paying humans to sit on those calls. Text AI is not touching this. Voice AI is.
The Scale Nobody Is Talking About
Let's put some numbers on the table that most AI investors are quietly ignoring:
Growing at ~5% CAGR, dominated by labor costs
Twilio alone processed trillions of call minutes in recent years
The single largest white-collar workforce on Earth
Includes labor, infrastructure, QA, training
Based on current voice AI platform pricing
Fireflies, Otter, Grain, Fathom fighting for share
Why Voice AI Is Technically Harder — and Why That Is a Moat
Voice AI is not text AI with audio. It is an entirely different class of problem, and the difficulty is exactly what creates sustainable competitive advantage.
Latency under 500ms
Conversations feel broken above that threshold. Text AI has no such constraint.
Accent and dialect handling
Enterprise deployments span global workforces. A model that fails on non-native speakers is useless at scale.
Real-time turn-taking
Detecting when someone is done speaking — including interruptions — is unsolved at the model level for most providers.
Background noise and audio quality
Call centers, warehouses, field ops — real business environments are not studio-quiet.
Emotional and prosodic context
The same words with different tone carry completely different meaning. Text AI loses this entirely.
Phone network codec degradation
PSTN calls compress audio to 8kHz. Models trained on clean audio fail on real calls.
Every one of these is a hard engineering problem that takes 12-24 months to solve in production. That timeline, multiplied across use cases and industries, is exactly why voice AI companies that crack this early have defensible positions. You can't prompt-engineer your way to a good voice product.
The Economic Case Is Overwhelming
I have backed companies across 65+ investments. The ones that move fastest are the ones selling into existing, measurable spend. Voice AI is selling into the most measurable spend in business: labor.
A mid-size insurance company handling 500,000 inbound calls per month at 4 minutes average talk time is spending $8M–$24M per year on call handling alone — before quality assurance, workforce management, and training. Voice AI that handles 70% of those calls at $0.10/minute saves $5M–$15M annually. The ROI case writes itself. No education required. No budget creation. Just pure displacement of existing line items.
- →Healthcare: patient intake, appointment scheduling, prior auth — all phone calls
- →Financial services: loan status, account inquiries, fraud disputes — 80% routine
- →E-commerce: order tracking, return initiation, delivery issues — pure volume plays
- →Real estate: inbound lead qualification — where speed to answer is the entire game
- →Field services: dispatch, scheduling, job updates — voice is the only interface that works
Where the Real Opportunity Sits
The voice AI stack has three layers, and investors are systematically underweighting the middle and top:
Infrastructure (overcrowded)
- • Speech-to-text: Deepgram, AssemblyAI, Whisper
- • Text-to-speech: ElevenLabs, PlayHT, Cartesia
- • Voice orchestration: Vapi, Bland AI, Retell
Vertical Application (where moats form)
- • Healthcare: Abridge, Nabla — clinical documentation from voice
- • Sales: Gong, Chorus, Salesloft — call intelligence that feeds CRM
- • Legal: Clio, Lexis+ AI — meeting and deposition capture
- • Financial services: domain-specific call compliance and QA
Workflow Intelligence (the real prize)
- • Voice-driven automation: not just transcription, but action
- • Cross-call intelligence: surfacing patterns across thousands of calls
- • Real-time agent assist: live coaching and compliance guardrails
- • Voice as the input layer for autonomous agents
What This Means for Founders and Investors
If you are building in AI right now, the question is not whether to add a voice feature. The question is whether voice should be your primary interface. For any product that sells into industries where work happens on a phone — healthcare, financial services, logistics, real estate, field services — the answer is almost always yes.
The founders I find most compelling right now are the ones who have picked a specific vertical, understand the exact call flows that drive revenue or cost in that business, and are replacing those call flows with AI that takes action — not just transcribes. Transcription is table stakes. Workflow automation is the business.
For investors, voice AI companies are still dramatically undervalued relative to their text counterparts despite sitting on larger TAMs and higher switching costs. The technical barriers that scared investors two years ago have largely been cleared. The remaining moat is vertical depth and workflow ownership — which is exactly where I want to invest.
Text AI is the interface for the office. Voice AI is the interface for the world.
Business runs on phone calls. The companies that own the phone call will own the workflow — and that is a much bigger prize than anyone is pricing in right now.