The Meta Llama 4 release answered a question the AI industry had been circling for two years: can open-weight models reach frontier quality? The answer is yes โ and the implications go far beyond benchmark scores.
Scout runs a 10 million token context window, the largest of any production model at launch. Maverick hit 1417 on LM Arena, beating GPT-4o. Behemoth, at roughly 2 trillion total parameters, was still training when Meta shipped the others. This wasn't an "open source catch-up" story. It was Meta establishing capability parity while making the models free to download and self-host.
What the Llama 4 Models Actually Are
All three Llama 4 models use a Mixture of Experts (MoE) architecture โ meaning only a fraction of total parameters activate per token, making them far more compute-efficient than dense models of similar capability.
| Model | Active Params | Total Params | Context Window | Experts |
|---|---|---|---|---|
| Llama 4 Scout | 17B | 109B | 10M tokens | 16 |
| Llama 4 Maverick | 17B | 400B | 1M tokens | 128 |
| Llama 4 Behemoth | ~288B | ~2T | TBD | 16 |
Source: Meta AI research blog, April 2025. Behemoth specifications are approximate; model was still in training at launch.
Why the Benchmark Numbers Matter (and Where They Don't)
Maverick's 1417 LM Arena score placed it above GPT-4o at launch โ a remarkable result for an open-weight model. On MMLU-Pro (expert-level reasoning), Maverick scored 80.5% versus GPT-4o's 74.4%. On the MATH benchmark, it hit 93.2%. These are not cherry-picked edge cases. This is a generalist model matching closed frontier quality.
The caveat: benchmarks are a floor, not a ceiling. Llama 4 Maverick's real-world coding performance lagged Claude Sonnet and GPT-4o on complex multi-file tasks in internal testing across several engineering teams I've spoken with. Instruction following and long-horizon agent reliability are areas where closed providers still have an edge in production. The gap is narrowing โ but it exists.
Strong at launch
- โMMLU-Pro: 80.5% (vs GPT-4o 74.4%)
- โMATH: 93.2%
- โMultilingual benchmarks
- โLM Arena rank: 1417 (beat GPT-4o)
Still trails closed models
- โComplex multi-file code generation
- โLong-horizon agent reliability
- โInstruction following edge cases
- โEnterprise SLA / uptime guarantees
What the Meta Llama 4 Release Does to the AI Market
Open-weight frontier models don't just compete โ they change the economics of the entire market. When you can download Maverick and self-host it for $0.00 per token, the conversation about OpenAI's $15/M input pricing for GPT-4o becomes unavoidable for any procurement team.
Enterprises
Self-hosting Llama 4 Scout or Maverick on AWS/GCP/Azure can cut AI inference costs 60โ80% vs closed API pricing for high-volume workloads. Privacy-sensitive industries (healthcare, finance, legal) get to keep data on-premises without sacrificing capability. The procurement calculus has fundamentally changed.
AI Startups Building on APIs
If your product is a layer on top of a closed model and Llama 4 can do the same job for free, you have a business model problem. The companies that survive this are the ones with proprietary data, vertical workflows, or distribution that extends beyond raw model access. Commodity capability is now the floor, not the ceiling.
OpenAI & Anthropic
Llama 4 doesn't kill them โ it forces them to compete on ecosystem, reliability, and frontier-exclusive capabilities (reasoning models, deep tool use, agent frameworks). The $300B+ combined valuation of these labs rests on the assumption they can maintain capability leads. Open-weight parity compresses that window. See our AI Valuations dashboard for where these companies are priced today.
Meta
Meta isn't selling AI โ it's spending $65B+ in 2025 capex to commoditize everyone else's AI. Open-sourcing competitive models undermines the moat of labs that depend on API revenue while Meta profits from AI making its ad products smarter and its platforms more engaging. This is a strategic subsidy, not a charity project.
The 10 Million Token Context Window Is a Product Decision, Not a Tech Demo
Scout's 10M-token context window is larger than anything GPT-4o, Claude Sonnet, or Gemini Pro offered at Llama 4's launch. To put it in practical terms: 10 million tokens is roughly 7,500 pages of text, or an entire mid-sized codebase, or years of customer support transcripts. In a single prompt.
The traditional answer to "how do you work with more context than a model supports" was RAG โ retrieval-augmented generation, where you chunk documents and retrieve relevant pieces. RAG works but introduces retrieval errors, chunking tradeoffs, and engineering complexity. A 10M-token context window doesn't eliminate RAG for all use cases, but it makes brute-force approaches viable for a much larger set of enterprise problems.
This matters especially for legal tech, healthcare documentation, financial analysis, and code review โ verticals where the entire corpus needs to be in-context for accurate reasoning. Companies building in these spaces should be actively evaluating Scout-based architectures against their RAG implementations.
Behemoth and What Comes Next
Llama 4 Behemoth โ approximately 2 trillion total parameters โ was still training when Meta shipped Scout and Maverick. Meta positioned it as a "teacher model" used to improve the smaller models through distillation, while also being available as a standalone frontier model for the most demanding workloads.
At 2T parameters, Behemoth is in the same weight class as the models that power GPT-4 and Claude 3 Opus. If it delivers proportional performance gains over Maverick and ships with a permissive enough license, it becomes the most capable open-weight model in history by a wide margin. That's a meaningful milestone for the market โ not because of what Behemoth does on benchmarks, but because of what it signals about Meta's long-term commitment to keeping the frontier open.
The pattern here is clear. Meta releases Llama 1 in February 2023, Llama 2 in July 2023, Llama 3 in April 2024, and Llama 4 in April 2025. Each generation narrows the gap with closed models. At this pace, the question for Llama 5 isn't whether it will match the frontier โ it's whether it will lead it. Track how these dynamics are playing out in AI startup valuations and big tech earnings.
What This Means for Founders and Investors
Opportunities Created
- โ Fine-tuning businesses on open-weight models with proprietary data
- โ Self-hosted AI infrastructure for regulated industries
- โ Long-context applications that were previously cost-prohibitive
- โ Open-weight model tooling (serving, evals, monitoring, fine-tune pipelines)
- โ Vertical AI companies with proprietary training data
Businesses Under Pressure
- โ API wrappers with no data or workflow differentiation
- โ RAG-only businesses where long-context eliminates the problem
- โ Model providers without a clear capability lead narrative
- โ Enterprise AI tools built on high-cost closed APIs with no switching cost
- โ Anything priced on "we're using GPT-4" as a feature
Meta is not building an AI company. It's building a world where no one else can build a moat on top of an AI model alone.
The companies that win the AI era won't be the ones with the best base model โ they'll be the ones with the best data, distribution, and workflow ownership.
Track AI company valuations and the open vs. closed model competition on the AI Valuations Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.