Every major AI lab has demonstrated multimodal capabilities that should have changed everything by now. They haven't. GPT-4o can see, hear, and respond in real time. Gemini Ultra processes a one-hour video in seconds. Claude reads medical imaging. None of this has produced a product people actually use to do their jobs differently.
Where We Are in Multimodal AI
The data is stark. As of Q1 2026, over $80B has been invested in foundation model companies with multimodal capabilities — Google DeepMind, OpenAI, Anthropic, Mistral, and dozens of specialized players. Yet consumer daily active users of multimodal features remain below 15% of all AI users, according to a16z's State of AI report. The capability gap closed fast: GPT-4V launched in late 2023, by 2024 every major model processed images and audio, and by early 2026 video understanding was table stakes.
But "can" and "does" are different things. The capability exists. The products built on top of it are underwhelming.
I've watched this pattern across 65+ investments: a breakthrough capability emerges, demos go viral, and then two years pass with no dominant application. The capability is real. The product design hasn't caught up. Multimodal AI is deep in that phase right now — and it means the opportunity window is wide open for the founders who figure it out first.
The Demo-to-Default Gap
When OpenAI showed GPT-4o in live demos, people gasped. It translated menus in real time, solved handwritten math, described rooms for the visually impaired. Those demos were real. The daily product usage was not — because multimodal AI requires a fundamentally different UX paradigm than text-in, text-out.
Users don't naturally think "let me share my screen with an AI" or "let me describe this by voice while showing this image." The input methods are awkward. The products are built around text chat boxes with an "attach image" button tacked on. That is not a multimodal product. That is a text product with image support. Adding multimodal to a chat interface is like adding a phone number to a messaging app — it's technically there, but nobody uses it that way.
The real multimodal killer app will not look like a chat interface. It will be designed from the ground up around the assumption that the AI can see and hear — and the experience breaks if it cannot.
Five Reasons Nobody Has Built It Yet
- •Persistent visual context is missing: today's products treat every image upload as a new conversation. The killer app needs to remember what it saw last week and connect it to what you're photographing today.
- •Voice is a button, not the interface: real multimodal workflows require voice as the primary input — not a microphone icon you have to tap. The UX needs to be designed around speaking first, typing second.
- •Cross-modal memory doesn't exist yet: models can't reliably link what they heard last Tuesday to what they're reading today to what they see in a photo right now. That cross-modal continuity is the foundation of a genuinely useful product.
- •Mobile-first design is ignored: the camera and microphone are on your phone, not your laptop. Most multimodal products are built desktop-first with a mobile app as an afterthought. The killer app will be born on mobile.
- •No deep enterprise workflow integration: the app has to hook into where work actually lives — Salesforce, Epic, Slack, AutoCAD, SAP. A standalone multimodal AI that doesn't write to the system of record is a demo, not a product.
Where the Real Opportunity Lives
The sectors closest to a genuine multimodal breakthrough are not consumer social apps. They are B2B tools where the workflow naturally combines seeing and talking — and where the economics of solving the problem are massive.
Healthcare is the biggest: a radiologist has two screens of imaging and dictates notes by voice. Combining visual interpretation of labs, imaging, and clinical notes in a single voice-and-vision workflow could save 2–4 hours per clinician per day — at $150+ per hour, that's a $40B+ productivity unlock in the US alone. Construction and engineering is close behind: a field engineer doing site inspections shouldn't type — they should walk, look, and talk while the AI logs findings, flags deviations from blueprints, and updates the project management system. Manufacturing is a near-term winner: defect detection via vision combined with voice-logged quality records is already being piloted at scale, and the first company to make it turnkey at $50K/year per facility will print money.
The pattern in all three cases: the worker is physically doing something, cannot type, needs to capture what they see and hear, and needs that data to flow into existing systems without extra steps. That is the multimodal killer app. It has nothing to do with a chat interface.
The multimodal killer app isn't a general assistant that can see. It's a vertical tool that can't function without seeing. When you remove the camera, the product breaks — that's how you know you've built it.
Stay current with VC and startup trends at Value Add VC. Originally published in the Trace Cohen newsletter.