Alibaba researchers introduced SkillWeaver, a framework designed to solve one of the more persistent problems in enterprise AI agent deployment: routing complex, multi-step tasks to the correct tools from a library that can contain hundreds or thousands of options, according to VentureBeat reporting published July 2, 2026. The framework's headline result is a token-consumption reduction of more than 99% compared to the naive approach of exposing an agent to an entire tool library at once.
The underlying problem is straightforward but costly at scale: as enterprise agents integrate with massive tool ecosystems like the Model Context Protocol (MCP), accurately routing a query to the right tool becomes difficult, and simply feeding an LLM an entire tool library to let it figure out the right one is highly inefficient β it quickly overwhelms context limits and consumes hundreds of thousands of tokens per query. Most existing tool-use frameworks treat this as a single-skill selection problem, which breaks down for real-world business requests like "download the dataset, transform it, and create visual reports" that require sequencing multiple distinct tools into a cohesive plan.
SkillWeaver addresses this through three stages β Decompose, Retrieve, and Compose β plus a technique the researchers call Iterative Skill-Aware Decomposition (SAD). An LLM first breaks a complex query into a sequence of atomic sub-tasks; an embedding-based retriever then pulls a shortlist of the best-matching tools for each sub-task from the library; and a final planning stage checks for compatibility between tools and assembles an executable plan as a directed acyclic graph, allowing independent steps to run in parallel. SAD's feedback loop is the key innovation: rather than a one-shot decomposition, the LLM drafts an initial plan, runs a preliminary search, and then uses the retrieved (often more technically precise) tool names to rewrite its decomposition so its vocabulary matches what's actually available in the library.
βOn the hardest tasks requiring four to five distinct skills, SAD improved accuracy by 50%.β
To evaluate the approach, the researchers built a custom benchmark, CompSkillBench, consisting of 300 multi-step queries against a library of 2,209 real-world tools sourced from the public MCP ecosystem across 24 functional categories including cloud infrastructure, finance and databases. Using a lightweight 7-billion-parameter model (Qwen2.5-7B-Instruct) for decomposition, the SAD feedback loop lifted decomposition accuracy from 51.0% in a vanilla setup to 67.7%; with the larger Qwen-Max model, accuracy reached 92%. On the hardest tasks requiring four to five distinct skills, SAD improved accuracy by 50%.
One of the more counterintuitive findings: larger models can actually perform worse than smaller ones when unguided, because they tend to over-decompose tasks into unnecessarily granular steps β a 14-billion-parameter model's accuracy fell below the 7B model's in the vanilla setup, until SAD's retrieved tool hints anchored it back to the actual available tools. A brute-force baseline that stuffed all tool names directly into a large model's prompt only retrieved the correct tool category 21.1% of the time despite near-perfect task-breakdown capability, while a traditional ReAct-style agent loop achieved 0% decomposition accuracy on the benchmark, collapsing multi-step plans into isolated actions.
The practical token savings are the headline number for enterprise adoption: SkillWeaver's targeted retrieve-and-route approach reduced estimated context consumption from roughly 884,000 tokens down to about 1,160 tokens per query β a 99.9% reduction that translates directly into lower API costs and faster response times for any team running agents against large, real-world tool libraries. The researchers have not yet released source code, but shared prompt templates and relied on off-the-shelf, easily reproducible components (a MiniLM-based embedding retriever with a FAISS index), meaning other teams can realistically implement the approach themselves using standard orchestration libraries.
For founders and engineering teams building enterprise AI agents, SkillWeaver's core lesson is that task-decomposition granularity, not raw model size, is the actual bottleneck in tool-routing accuracy β a finding with direct cost implications for any company running agents against MCP-scale tool libraries. For investors in agent infrastructure and orchestration tooling, this is a reminder that meaningful efficiency gains in agentic AI are increasingly coming from smarter retrieval and planning architecture rather than simply waiting for cheaper, faster frontier models.
What to watch: whether Alibaba releases SkillWeaver's source code for broader adoption and benchmarking, how the approach performs on tool libraries larger than the 2,209 tested here, and whether other labs or enterprise AI vendors adopt similar skill-aware decomposition techniques as agentic tool libraries continue to grow in scale.