VC
Value Add VC
โšกHomePulseโšกHelpful Apps๐Ÿ“Blog
Home/Blog/Meta Llama 4 Release: 3 Models, 10M-Token Context, and How Maverick Benchmarks vs GPT-4o
AI & TechnologyJune 27, 2026ยท10 min readยทLast updated: June 27, 2026

Meta Llama 4 Release: 3 Models, 10M-Token Context, and How Maverick Benchmarks vs GPT-4o

Llama 4 is Meta's first mixture-of-experts, natively multimodal model family โ€” three variants, a 10-million-token context window, and benchmarks that put Maverick in the same tier as GPT-4o, all shipped as free open weights. Here's what's actually new and what to use it for.

TC
Trace Cohen
Co-Founder & GP at Six Point Ventures ยท 3x founder (BrandYourself, Launch.it, SPOT) ยท 65+ investments ยท Based in Boca Raton, FL
@Trace_Cohenยทt@nyvp.comยทSouth Florida Advisory

Quick Answer

Three Llama 4 variants shipped April 5, 2025: Scout (17B active/109B total params, 10M-token context), Maverick (17B active/400B total), and the still-training Behemoth (288B active/~2T total). Maverick matches or beats GPT-4o and Gemini 2.0 Flash on key benchmarks as a free open-weight model.

Meta released Llama 4 on April 5, 2025 in three variants โ€” Scout (17B active, 109B total params, a 10-million-token context window), Maverick (17B active, 400B total), and the still-training Behemoth (288B active, ~2T total) โ€” its first open-weight models built on a mixture-of-experts architecture.

That's the short answer. The longer answer is more interesting โ€” because Llama 4 isn't just a bigger Llama. It's a structural break: Meta's first MoE design, its first natively multimodal family, and the first time an open-weight model shipped with a context window an order of magnitude larger than anything from OpenAI or Anthropic. Here's exactly what launched, how Maverick stacks up against GPT-4o, what it costs, and what it's actually good for.

Meta Llama 4 Release: What Shipped and When

The Meta Llama 4 release happened on April 5, 2025, when Meta open-sourced two models โ€” Llama 4 Scout and Llama 4 Maverick โ€” and previewed a third, Llama 4 Behemoth, that was still training. All three use a mixture-of-experts architecture and are natively multimodal, meaning they process text and images in one model. It was the biggest architectural change in the Llama line since the original 2023 launch.

The mixture-of-experts (MoE) design is the headline change. Instead of activating every parameter for every token like a dense model, an MoE model routes each token through a small subset of specialized "experts." Maverick has 400B total parameters across 128 experts but activates only 17B per token โ€” so you get the knowledge of a 400B model at the inference cost of a 17B one. That economics shift is the whole point: it's how Meta delivers frontier quality at open-weight prices.

Llama 4 also arrived against a backdrop of staggering spending โ€” Meta's 2025 capex ran $65โ€“72B, most of it AI infrastructure to train exactly these models. I broke down where that money goes in Meta's $65B AI capex piece. Llama 4 is the product that build was paying for.

The Three Llama 4 Models: Scout, Maverick, and Behemoth

Each variant targets a different point on the cost-capability curve. Scout is the efficient workhorse that fits on a single GPU. Maverick is the flagship general-purpose model. Behemoth is the frontier "teacher" that trained the other two. Here is how the specs compare.

ModelActive ParamsTotal ParamsExpertsContextStatus
Llama 4 Scout17B109B1610M tokensReleased Apr 2025
Llama 4 Maverick17B400B1281M tokensReleased Apr 2025
Llama 4 Behemoth288B~2T16Not disclosedIn training (teacher)
Llama 3.1 405B405B (dense)405Bโ€”128K tokensPrior gen (2024)
Llama 3.3 70B70B (dense)70Bโ€”128K tokensPrior gen (2024)
Llama 4 Scout (Int4)17B109B1610M tokensFits on 1ร— H100

Figures are from Meta's April 2025 Llama 4 launch materials, the official model cards on llama.com and Hugging Face, and the Llama 3.x model cards for comparison. Behemoth specs are as disclosed at preview; the model had not been publicly released at the time of writing.

The single most striking number is Scout's 10-million-token context window โ€” roughly 80x larger than Llama 3's 128K and bigger than GPT-4o (128K) or Claude (200K) at launch. And the Int4-quantized Scout fits on one Nvidia H100, which means a single ~$30,000 GPU can run a model with that context. That combination โ€” long context plus single-GPU deployment โ€” is what makes Scout genuinely novel rather than just another checkpoint.

Llama 4 Benchmarks vs GPT-4o and Gemini 2.0 Flash

On the benchmarks Meta published, Maverick is competitive with or ahead of GPT-4o and Gemini 2.0 Flash across reasoning, coding, and multimodal tasks โ€” and it does it while activating a fraction of the parameters. An experimental chat-tuned version of Maverick posted an ELO around 1417 on the LMArena human-preference leaderboard at launch, placing it near the top of all models. Here's the rough competitive picture.

ModelActive ParamsOpen Weights?Multimodal?Positioning
Llama 4 Maverick17BYesYes (native)Beats/matches GPT-4o on key tests
GPT-4oUndisclosedNoYesClosed, API-only flagship
Gemini 2.0 FlashUndisclosedNoYesFast, cheap, closed
DeepSeek v337BYesNoOpen MoE, strong reasoning
Llama 4 Scout17BYesYes (native)Best-in-class long context
Claude Sonnet 3.7UndisclosedNoYesClosed, top coding model

Comparison blends Meta's published Llama 4 benchmark tables, LMArena leaderboard data, and each provider's model documentation. The LMArena ELO reflected an experimental chat variant of Maverick, which Meta acknowledged differed from the released weights; treat head-to-head claims as directional, not audited.

One honest caveat: the LMArena score came from a specially tuned chat version, and Meta later clarified the released Maverick weights weren't identical to the model that posted that ELO. That sparked legitimate criticism. The fair read is that Maverick is a genuinely strong GPT-4o-class model on most tasks, but the "#2 on the leaderboard" headline was oversold. You can see how the market values the labs building these models on our AI Valuations dashboard.

Llama 4 Pricing, Licensing, and How to Run It

The weights are free to download from llama.com and Hugging Face, and Llama 4 is built into Meta AI across WhatsApp, Messenger, and Instagram in 40+ countries. The catch is the license. The Llama 4 Community License is permissive for commercial use โ€” with one exception: companies with more than 700 million monthly active users must request a separate license from Meta. That clause exists to keep Google, Amazon, and ByteDance from freely productizing Llama; for a startup or a mid-market enterprise it's irrelevant.

If you don't want to self-host, third-party inference providers like Together, Fireworks, Groq, and the major clouds serve Llama 4 by the token. Maverick lands in the rough range of $0.20โ€“0.50 per million input tokens and $0.60โ€“0.90 per million output tokens depending on provider โ€” meaningfully cheaper than GPT-4o's roughly $2.50/$10 per million. That price gap, not the benchmark bragging rights, is the real reason builders care: for high-volume workloads, an open GPT-4o-class model at a fifth of the cost changes the unit economics.

Self-hosting math: Scout in Int4 runs on a single H100, while Maverick needs roughly an 8-GPU node (one H100 DGX-class server). For teams already paying for GPUs, the marginal cost per token approaches zero โ€” the appeal of open weights for anyone running serious inference volume.

Llama 4 Use Cases: Where the Open-Weight Model Actually Wins

The benchmark wars matter less than fit-for-purpose. Llama 4 wins decisively in three scenarios. First, data-sensitive deployments โ€” banks, hospitals, and defense contractors that can't send data to a third-party API but can run open weights inside their own VPC. Second, high-volume inference where the 5x cost gap versus GPT-4o compounds into real money at millions of calls a day. Third, long-context work โ€” Scout's 10M-token window can ingest an entire codebase, a quarter of legal filings, or a year of support tickets in one pass without a retrieval pipeline.

Where Llama 4 is not the obvious pick: pure frontier reasoning and agentic coding, where Claude and OpenAI's reasoning models still lead, and consumer apps that just want the best out-of-the-box answer with zero infra. Llama 4 also shipped without a dedicated "reasoning" variant at launch, a gap competitors had already filled. With over 1 billion cumulative Llama downloads, though, the open ecosystem around it โ€” fine-tunes, quantizations, tooling โ€” is unmatched, and that network effect is its quiet moat.

For investors, the strategic point is that Llama 4 commoditizes the model layer. When a free open-weight model is GPT-4o-class, the value migrates to the application and infrastructure layers โ€” exactly the bet our AI Valuations and AI Spending dashboards track.

The Bottom Line

Llama 4 made a free, open-weight, GPT-4o-class model available to anyone with a GPU โ€” and a 10M-token context window no closed lab could match at launch.

The leaderboard controversy was a self-inflicted wound, and Llama 4 still trails the best closed reasoning models. But that misses the point. Meta isn't selling tokens โ€” it's commoditizing the layer its rivals charge for, funded by a $65B+ capex build and an ad business that prints cash. For builders, Maverick at a fifth of GPT-4o's price and Scout's 10M-token context are the most consequential open-source releases of the year. The model layer is becoming a utility, and Llama 4 is the clearest proof yet.

Track AI model valuations, big-tech capex, and infrastructure spend on the AI Valuations, Big Tech Earnings, and AI Spending dashboards at Value Add VC. Originally published in the Trace Cohen newsletter.

ShareXLinkedInEmail

Frequently Asked Questions

When was Meta Llama 4 released?

Meta released Llama 4 on April 5, 2025, launching two models immediately โ€” Llama 4 Scout and Llama 4 Maverick โ€” with a third and far larger model, Llama 4 Behemoth, still in training and previewed but not shipped. It was Meta's first model family built on a mixture-of-experts (MoE) architecture and its first to be natively multimodal, handling text and images in a single model.

How does Llama 4 Maverick compare to GPT-4o?

Llama 4 Maverick (17B active, 400B total parameters) matches or beats GPT-4o and Gemini 2.0 Flash on several reasoning, coding, and multimodal benchmarks while activating only 17B parameters per token, which makes inference cheaper. An experimental chat version scored an ELO of roughly 1417 on the LMArena leaderboard at launch. The key difference is that Maverick ships as open weights, while GPT-4o is closed and API-only.

What is the Llama 4 context window?

Llama 4 Scout supports a context window of up to 10 million tokens โ€” the largest of any major model at its release and roughly 80x larger than Llama 3's 128K window. Maverick supports up to 1 million tokens. A 10M-token window can hold millions of words at once, enabling whole-codebase analysis, multi-document summarization, and long-running agent memory without external retrieval systems.

Is Llama 4 free to use?

Yes, Llama 4 is released under the Llama 4 Community License, which makes the model weights free to download and use commercially. The main restriction is that companies with more than 700 million monthly active users must request a separate license from Meta โ€” a clause aimed squarely at competitors like Google and Amazon. For nearly every startup and enterprise, Llama 4 is effectively free to self-host.

What is Llama 4 Behemoth?

Llama 4 Behemoth is Meta's largest model, with 288 billion active parameters and roughly 2 trillion total parameters across 16 experts. It was still in training at the April 2025 launch and was used as a 'teacher' model to distill Scout and Maverick. Meta claims Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks, though it has not been publicly released.

Related Tools & Dashboards

๐Ÿค–AI Valuations๐Ÿ’นBig Tech Earnings๐Ÿ’ธAI Spending

Keep Reading

๐Ÿ“กMeta $65B AI Capex 2025: Llama, Reality Labs, and the Infrastructure Bet๐Ÿค–Claude in Amazon Bedrock: How AWS Is Selling Anthropic's Models๐Ÿ”งAI Hardware Wars: Nvidia vs AMD vs Google TPU โ€” Who's Winning

Explore 45+ free VC tools, dashboards, and recommended startup software.

Explore DashboardsHelpful Apps & Platforms

Trace Cohen is a serial founder, investor and data geek. Please feel free to reach out t@nyvp.com

VC
Value Add VC
Helpful AppsTwitterContact