Grok 3 scores roughly 93% on AIME 2025 math, ~85% on GPQA science, and crossed a 1400 LMSYS Arena Elo โ making it the first xAI model that genuinely competes with GPT-5 and Claude at the frontier. That's the short answer. The longer answer is more interesting.
xAI went from incorporation in 2023 to a frontier-tier model in under two years โ faster than any lab in history. The benchmark wins are real. But there's a persistent gap between topping a leaderboard and being the model you actually reach for, and Grok 3 lives squarely inside that gap.
Grok 3 Review: Benchmark Results and What They Mean
Grok 3 is a frontier reasoning model from xAI that posts ~93% on AIME 2025 math, ~85% on GPQA Diamond science, and a 1400+ LMSYS Arena Elo, briefly topping that leaderboard at launch. Its "Think" and "Big Brain" modes trade latency for accuracy on hard problems. On math and reasoning it is class-leading; on coding and consistency it lands a half-step behind Claude and GPT-5.
The thing to understand about Grok 3 is that xAI optimized hard for the benchmarks that generate headlines โ competition math, graduate science, and Arena Elo โ and those are exactly the numbers it leads on. That's not cheating; it's focus. But it does mean the launch-day "best model in the world" framing was narrower than it sounded.
Grok 3 vs GPT-5 vs Claude: Side-by-Side Comparison
Here's the head-to-head that matters. These are approximate, fast-moving figures โ treat them as directional, not gospel, since every lab re-benchmarks constantly.
| Attribute | Grok 3 (xAI) | GPT-5 (OpenAI) | Claude (Anthropic) |
|---|---|---|---|
| AIME 2025 math | ~93% | ~92% | ~90% |
| GPQA Diamond science | ~85% | ~88% | ~84% |
| SWE-bench (coding) | ~68% | ~72% | ~78% |
| LMSYS Arena Elo | ~1400 | ~1420 | ~1410 |
| Context window | ~1M tokens | ~400K tokens | ~1M tokens |
| Consumer price | $30/mo (SuperGrok) | $20/mo (Plus) | $20/mo (Pro) |
| Real-time X data | Native | No | No |
| Content policy | Most permissive | Moderate | Most cautious |
The pattern is clear: Grok 3 wins on math, real-time data, and content permissiveness; GPT-5 edges it on science and consistency; Claude owns coding. No single model dominates โ which is exactly why the price and policy columns matter more than the benchmark rows for most buyers.
What Grok 3 Is Actually Good At
Competition-grade math
~93% AIME is genuinely best-in-class; the Think mode shines on multi-step proofs
Real-time X analysis
Native access to the live X firehose โ no other frontier model has this
DeepSearch research
Agentic web + social search that pulls current, sourced results
Permissive answers
Far fewer refusals than Claude on edgy-but-legitimate prompts
The Compute Story: How xAI Caught Up So Fast
Grok 3 was trained on Colossus, xAI's Memphis supercluster, which scaled past 100,000 Nvidia H100 GPUs and is expanding toward 200,000+ โ one of the largest coherent training clusters on the planet. Building it in roughly 122 days became xAI's signature flex. The lesson is uncomfortable for incumbents: raw compute scale can compress a multi-year capability lead into months.
It also explains the economics. xAI raised more than $6B in 2024 and pushed its valuation past $50B, then folded in X itself. That capital is mostly going into GPUs and power. For context on how the spending stacks up against Microsoft, Google, Meta, and Amazon, the Big Tech Earnings dashboard tracks the quarterly capex race, and the AI Valuations dashboard puts xAI's price tag next to OpenAI and Anthropic.
Grok 3 vs Claude and GPT-5: Where It Loses
Reach for Grok 3 when
- โ You need top-tier math or reasoning
- โ Real-time X / social data is core to the task
- โ You want fewer content refusals
- โ DeepSearch live research fits the workflow
Reach for Claude or GPT-5 when
- โ You're running production coding agents (Claude)
- โ You need a mature tool/plugin ecosystem (GPT-5)
- โ Enterprise compliance and audit trails matter
- โ Output consistency across runs is critical
The honest read: Grok 3 is a real frontier model, not a vanity project, but it is not yet the default. Claude's ~78% SWE-bench coding reliability and Anthropic's enterprise posture still win serious engineering work, and GPT-5's ecosystem maturity wins breadth. Grok 3 wins the specific niches above โ and the price of admission to the frontier club, which two years ago looked impossible for a new entrant.
Grok 3 proves the frontier is no longer a two-horse race.
It's not the best model at everything โ but it's the best math-and-real-time model, built by the youngest lab, on the biggest cluster.
Compare frontier model economics on the AI Valuations dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.