Meta released Llama 4 on April 5, 2025 in three variants โ Scout (17B active, 109B total params, a 10-million-token context window), Maverick (17B active, 400B total), and the still-training Behemoth (288B active, ~2T total) โ its first open-weight models built on a mixture-of-experts architecture.
That's the short answer. The longer answer is more interesting โ because Llama 4 isn't just a bigger Llama. It's a structural break: Meta's first MoE design, its first natively multimodal family, and the first time an open-weight model shipped with a context window an order of magnitude larger than anything from OpenAI or Anthropic. Here's exactly what launched, how Maverick stacks up against GPT-4o, what it costs, and what it's actually good for.
Meta Llama 4 Release: What Shipped and When
The Meta Llama 4 release happened on April 5, 2025, when Meta open-sourced two models โ Llama 4 Scout and Llama 4 Maverick โ and previewed a third, Llama 4 Behemoth, that was still training. All three use a mixture-of-experts architecture and are natively multimodal, meaning they process text and images in one model. It was the biggest architectural change in the Llama line since the original 2023 launch.
The mixture-of-experts (MoE) design is the headline change. Instead of activating every parameter for every token like a dense model, an MoE model routes each token through a small subset of specialized "experts." Maverick has 400B total parameters across 128 experts but activates only 17B per token โ so you get the knowledge of a 400B model at the inference cost of a 17B one. That economics shift is the whole point: it's how Meta delivers frontier quality at open-weight prices.
Llama 4 also arrived against a backdrop of staggering spending โ Meta's 2025 capex ran $65โ72B, most of it AI infrastructure to train exactly these models. I broke down where that money goes in Meta's $65B AI capex piece. Llama 4 is the product that build was paying for.
The Three Llama 4 Models: Scout, Maverick, and Behemoth
Each variant targets a different point on the cost-capability curve. Scout is the efficient workhorse that fits on a single GPU. Maverick is the flagship general-purpose model. Behemoth is the frontier "teacher" that trained the other two. Here is how the specs compare.
| Model | Active Params | Total Params | Experts | Context | Status |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B | 109B | 16 | 10M tokens | Released Apr 2025 |
| Llama 4 Maverick | 17B | 400B | 128 | 1M tokens | Released Apr 2025 |
| Llama 4 Behemoth | 288B | ~2T | 16 | Not disclosed | In training (teacher) |
| Llama 3.1 405B | 405B (dense) | 405B | โ | 128K tokens | Prior gen (2024) |
| Llama 3.3 70B | 70B (dense) | 70B | โ | 128K tokens | Prior gen (2024) |
| Llama 4 Scout (Int4) | 17B | 109B | 16 | 10M tokens | Fits on 1ร H100 |
Figures are from Meta's April 2025 Llama 4 launch materials, the official model cards on llama.com and Hugging Face, and the Llama 3.x model cards for comparison. Behemoth specs are as disclosed at preview; the model had not been publicly released at the time of writing.
The single most striking number is Scout's 10-million-token context window โ roughly 80x larger than Llama 3's 128K and bigger than GPT-4o (128K) or Claude (200K) at launch. And the Int4-quantized Scout fits on one Nvidia H100, which means a single ~$30,000 GPU can run a model with that context. That combination โ long context plus single-GPU deployment โ is what makes Scout genuinely novel rather than just another checkpoint.
Llama 4 Benchmarks vs GPT-4o and Gemini 2.0 Flash
On the benchmarks Meta published, Maverick is competitive with or ahead of GPT-4o and Gemini 2.0 Flash across reasoning, coding, and multimodal tasks โ and it does it while activating a fraction of the parameters. An experimental chat-tuned version of Maverick posted an ELO around 1417 on the LMArena human-preference leaderboard at launch, placing it near the top of all models. Here's the rough competitive picture.
| Model | Active Params | Open Weights? | Multimodal? | Positioning |
|---|---|---|---|---|
| Llama 4 Maverick | 17B | Yes | Yes (native) | Beats/matches GPT-4o on key tests |
| GPT-4o | Undisclosed | No | Yes | Closed, API-only flagship |
| Gemini 2.0 Flash | Undisclosed | No | Yes | Fast, cheap, closed |
| DeepSeek v3 | 37B | Yes | No | Open MoE, strong reasoning |
| Llama 4 Scout | 17B | Yes | Yes (native) | Best-in-class long context |
| Claude Sonnet 3.7 | Undisclosed | No | Yes | Closed, top coding model |
Comparison blends Meta's published Llama 4 benchmark tables, LMArena leaderboard data, and each provider's model documentation. The LMArena ELO reflected an experimental chat variant of Maverick, which Meta acknowledged differed from the released weights; treat head-to-head claims as directional, not audited.
One honest caveat: the LMArena score came from a specially tuned chat version, and Meta later clarified the released Maverick weights weren't identical to the model that posted that ELO. That sparked legitimate criticism. The fair read is that Maverick is a genuinely strong GPT-4o-class model on most tasks, but the "#2 on the leaderboard" headline was oversold. You can see how the market values the labs building these models on our AI Valuations dashboard.
Llama 4 Pricing, Licensing, and How to Run It
The weights are free to download from llama.com and Hugging Face, and Llama 4 is built into Meta AI across WhatsApp, Messenger, and Instagram in 40+ countries. The catch is the license. The Llama 4 Community License is permissive for commercial use โ with one exception: companies with more than 700 million monthly active users must request a separate license from Meta. That clause exists to keep Google, Amazon, and ByteDance from freely productizing Llama; for a startup or a mid-market enterprise it's irrelevant.
If you don't want to self-host, third-party inference providers like Together, Fireworks, Groq, and the major clouds serve Llama 4 by the token. Maverick lands in the rough range of $0.20โ0.50 per million input tokens and $0.60โ0.90 per million output tokens depending on provider โ meaningfully cheaper than GPT-4o's roughly $2.50/$10 per million. That price gap, not the benchmark bragging rights, is the real reason builders care: for high-volume workloads, an open GPT-4o-class model at a fifth of the cost changes the unit economics.
Self-hosting math: Scout in Int4 runs on a single H100, while Maverick needs roughly an 8-GPU node (one H100 DGX-class server). For teams already paying for GPUs, the marginal cost per token approaches zero โ the appeal of open weights for anyone running serious inference volume.
Llama 4 Use Cases: Where the Open-Weight Model Actually Wins
The benchmark wars matter less than fit-for-purpose. Llama 4 wins decisively in three scenarios. First, data-sensitive deployments โ banks, hospitals, and defense contractors that can't send data to a third-party API but can run open weights inside their own VPC. Second, high-volume inference where the 5x cost gap versus GPT-4o compounds into real money at millions of calls a day. Third, long-context work โ Scout's 10M-token window can ingest an entire codebase, a quarter of legal filings, or a year of support tickets in one pass without a retrieval pipeline.
Where Llama 4 is not the obvious pick: pure frontier reasoning and agentic coding, where Claude and OpenAI's reasoning models still lead, and consumer apps that just want the best out-of-the-box answer with zero infra. Llama 4 also shipped without a dedicated "reasoning" variant at launch, a gap competitors had already filled. With over 1 billion cumulative Llama downloads, though, the open ecosystem around it โ fine-tunes, quantizations, tooling โ is unmatched, and that network effect is its quiet moat.
For investors, the strategic point is that Llama 4 commoditizes the model layer. When a free open-weight model is GPT-4o-class, the value migrates to the application and infrastructure layers โ exactly the bet our AI Valuations and AI Spending dashboards track.
The Bottom Line
Llama 4 made a free, open-weight, GPT-4o-class model available to anyone with a GPU โ and a 10M-token context window no closed lab could match at launch.
The leaderboard controversy was a self-inflicted wound, and Llama 4 still trails the best closed reasoning models. But that misses the point. Meta isn't selling tokens โ it's commoditizing the layer its rivals charge for, funded by a $65B+ capex build and an ad business that prints cash. For builders, Maverick at a fifth of GPT-4o's price and Scout's 10M-token context are the most consequential open-source releases of the year. The model layer is becoming a utility, and Llama 4 is the clearest proof yet.
Track AI model valuations, big-tech capex, and infrastructure spend on the AI Valuations, Big Tech Earnings, and AI Spending dashboards at Value Add VC. Originally published in the Trace Cohen newsletter.