VC
Value Add VC
⚡HomePulse⚡Helpful Apps📝Blog
← Value Add PulseAI~28x efficiency

A New Agentic-Memory Framework Uses 118K Tokens Per Query -- LangMem Burns 3.26M

A new agentic-memory framework reportedly answers queries using about 118,000 tokens, versus roughly 3.26 million for the widely used LangMem approach -- a nearly 28x reduction in token consumption. As AI agents proliferate, how efficiently they remember and retrieve context is becoming a decisive driver of cost, latency and viability.

~118K tokens / query
New Framework
~3.26M tokens / query
LangMem
~28x fewer tokens
Reduction
Agentic memory
Domain
TC
Trace Cohen
Early-stage VC & angel · Founder, New York Venture Partners
June 26, 2026
2 min read
KEY TAKEAWAYS FOR VCs & FOUNDERS
1

Token efficiency is becoming the make-or-break economic variable for production AI agents

2

A ~28x cut in tokens per query directly slashes inference cost and latency

3

Memory architecture is emerging as a real competitive battleground in agent infrastructure

4

It is the unglamorous engineering that separates demos from deployable systems

TC
The VC Read · Trace's TakeTrace Cohen

Token efficiency is the boring metric that decides which agent companies actually make money. A 28x reduction in tokens per query isn't a benchmark flex -- it's the gap between a viable gross margin and a product that bleeds cash every time a user hits enter. As agents move to persistent, multi-step work, memory architecture becomes a real moat, not plumbing. This is the same lesson as the inference-chip gold rush: the AI business is won on cost-to-serve. Reproduce the numbers before believing them -- but if they hold, efficient memory becomes table stakes fast, and the bloated frameworks get repriced.

🤖 AI Landscape →

A newly detailed agentic-memory framework reportedly handles queries using roughly 118,000 tokens, compared with about 3.26 million tokens for the popular LangMem approach -- a nearly 28-fold reduction in token consumption, according to VentureBeat. In a world where every token has a price and adds latency, that gap is the difference between an agent that is economically viable at scale and one that is not.

The problem it addresses is becoming central to building real AI agents. Agents need memory -- the ability to recall prior context, facts and interactions across long-running tasks -- but naive approaches balloon the amount of text fed into the model on every step, driving up cost and slowing responses. As agents take on multi-step, persistent workflows, inefficient memory can make them prohibitively expensive, turning an impressive demo into an untenable product.

“The problem it addresses is becoming central to building real AI agents.”

The efficiency framing reflects a maturing of the agent ecosystem. The early phase celebrated what agents could do; the current phase is obsessed with what they cost to run reliably. Token-per-query is becoming a headline metric precisely because it maps directly to gross margin for anyone deploying agents at volume -- the same economic logic driving investment into inference chips and serving software from Groq to Baseten.

Memory architecture is now a genuine competitive battleground. Frameworks like LangMem, along with the memory layers inside agent platforms from LangChain, LlamaIndex and a field of startups, are competing on how intelligently they store, compress and retrieve context. A framework claiming a near-30x efficiency advantage, if it holds up, is the kind of infrastructure improvement that quietly determines which agent products can actually make money.

The bear case is the usual caveat on benchmark claims: a single comparison may not generalize, token savings can trade off against answer quality or recall, and real-world workloads vary. What to watch: independent reproductions of the efficiency numbers across diverse tasks, whether the framework preserves answer quality at lower token counts, and how quickly efficient memory becomes standard in production agent stacks.

ShareXLinkedInEmail

Originally reported by VentureBeat. Analysis and editorial commentary by Value Add Pulse.

← Back to Pulse

Markets Now

live
SPCX▲+0.58%
$236.20
CBRS▲+0.66%
$259.10
SPY▲+0.11%
5,968.40
QQQ▲+0.22%
20,142.30
NVDA▲+1.00%
$152.10
MSFT▼-0.35%
$478.40
GOOGL▲+0.71%
$211.80
META▲+0.24%
$659.50

Read Next

AIGated frontier launch

OpenAI Unveils GPT-5.6 Sol, Terra and Luna -- but Only for Government-Vetted Preview Partners

OpenAI unveiled a new GPT-5.6 model family -- code-named Sol, Terra and Luna -- but said access is limited to a small set of preview partners disclosed to the US government, the same gating regime now applied to Anthropic's most capable models. It is the clearest sign yet that frontier-model releases are passing through a national-security filter before reaching the broader market.

AI~3x output

Claude Code Turned Every Engineer Into Three -- Now Companies Need More Product Thinkers

A widely shared analysis argues that AI coding tools like Claude Code have multiplied individual engineer output several-fold, shifting the binding constraint in software organizations from writing code to deciding what to build. The implication: the scarce role is no longer the coder but the product thinker who can direct all that newfound capacity toward something valuable.

AIOrbital compute debate

SoftBank's CEO Isn't Alone in Doubting Elon Musk's Orbital Data Center Hype

As Elon Musk pitches data centers in orbit to power the next wave of AI compute, skepticism is mounting -- and SoftBank's Masayoshi Son, himself one of AI's biggest spenders, is among the doubters. Critics question the physics, economics and cooling realities of running AI clusters in space, casting the idea as visionary marketing more than near-term infrastructure.

@Trace_Cohen·t@nyvp.com