DeepSeek and Peking University have jointly open-sourced DSpark, a speculative-decoding framework that accelerates large language model inference by 60% to 85% per user -- and delivers up to a 661% throughput gain under strict latency constraints -- with no hardware upgrades or model retraining required, according to VentureBeat. It is released under the MIT license on GitHub, alongside DeepSpec, a general-purpose codebase for training custom draft models.
The technical approach attacks a known limitation of speculative decoding. Classic methods train a separate, smaller 'draft' model to propose tokens that the larger target model then verifies -- effective, but costly to build and maintain. DSpark instead grafts the speculative head directly onto the target model, reducing layer duplication, and pairs a 'semi-autoregressive generation' method with a 'confidence-scheduled verification' system. The deployed configuration, 'DSpark-5,' improves per-user generation speed by 60-85% on DeepSeek-V4-Flash and 57-78% on V4-Pro.
“The deployed configuration, 'DSpark-5,' improves per-user generation speed by 60-85% on DeepSeek-V4-Flash and 57-78% on V4-Pro.”
The significance is economic. Inference -- the perpetual cost of serving a model with every query -- is the dominant and fastest-growing line item in AI, the same dynamic driving billion-dollar bets on Baseten, Groq and Upscale AI. A free, open framework that wrings 60-85% more speed out of existing hardware attacks that cost from the software side, reducing the pressure to buy ever-more accelerators. For a Chinese ecosystem constrained by US export controls on the most advanced GPUs, squeezing more out of available compute is a strategic necessity, not just an optimization.
The timing makes a pattern. DSpark landed within 48 hours of Meituan open-sourcing the 1.6-trillion-parameter LongCat-2.0 -- two MIT-licensed releases from Chinese players in the same window, both aimed at efficiency and openness. Crucially, DeepSpec is model-agnostic, with configurations supporting Alibaba's Qwen and Google's Gemma, so DSpark's gains can spread well beyond DeepSeek's own models and into the broader open-source community.
The bear case is that vendor-reported speedups need independent verification, real-world gains vary by workload, and speculative decoding can trade accuracy for speed if poorly tuned. Western enterprises may also hesitate to build inference infrastructure around Chinese-origin frameworks regardless of license. What to watch: independent benchmarks of DSpark across model families, how quickly the open-source community adopts DeepSpec, and whether efficiency breakthroughs like this meaningfully blunt the impact of US chip restrictions.