Tool-R0: A Self-Play RL Framework for Training Tool-Using AI Agents with Zero Data

Tool-R0: A Self-Play RL Framework for Training Tool-Using AI Agents with Zero Data

The arXiv paper Tool-R0 achieves 92.5% improvement in LLM tool-calling via Self-Play RL alone, with no training data. We analyze its Generator-Solver co-evolution and practical implications.

The core capability of an AI agent is “the ability to accurately invoke external tools.” Calling APIs, querying databases, executing code — without these abilities, an agent is nothing more than a simple chatbot. Yet training this tool-calling capability has traditionally required tens or even hundreds of thousands of labeled data samples.

Tool-R0 (Acikgoz et al., arXiv 2602.21320), published on arXiv in February 2026, upends this assumption. Starting from zero training data, it trains a tool-calling agent from scratch using only Self-Play reinforcement learning — and surpasses conventional supervised learning approaches in performance.

Why This Paper Matters Right Now

The AI agent market is experiencing rapid growth centered on tool-calling (Function Calling / Tool Use) capabilities. OpenAI’s Function Calling, Anthropic’s Tool Use, Google’s Gemini Function Calling — all frontier models ship with this capability as a core feature.

However, equipping open-source or domain-specific models with this capability has required expensive training data construction:

  • xLAM dataset: 60,000 tool-calling examples
  • Hammer dataset: 210,000 examples
  • ToolACE dataset: 12,000 examples

These datasets must be rebuilt every time the domain changes, and customizing them for internal enterprise APIs is even more difficult. Tool-R0 eliminates this bottleneck entirely through Self-Play RL.

The Core Idea of Tool-R0: Generator-Solver Co-Evolution

Tool-R0’s architecture is remarkably elegant. Two independent agents are initialized from a single base LLM:

graph TD
    subgraph "Tool-R0 Self-Play Cycle"
        G["Generator πθ<br/>Task Creator"] -->|Generates challenging tasks| D["Task Pool<br/>10,000 items"]
        D -->|Curriculum-based selection| S["Solver πϕ<br/>Task Solver"]
        S -->|Success rate feedback| G
    end
    G -->|"Self-evolves via<br/>GRPO rewards"| G
    S -->|"Self-evolves via<br/>GRPO rewards"| S

The Generator (πθ) creates tool-calling tasks. Specifically, it produces triplets of (user query, tool menu, ground-truth tool call).

The Solver (πϕ) learns to predict the correct tool call from a given query and tool list.

The key is that they are connected through complementary reward signals:

  • The Generator receives a high reward when it creates problems that are moderately challenging for the Solver
  • The Solver receives a high reward when it executes accurate tool calls

As this interaction repeats, the Generator creates increasingly sophisticated problems while the Solver learns to tackle progressively harder tasks — all without any data.

The Sophistication of Reward Design

The reason Tool-R0 achieves such strong performance lies in its reward function design.

Generator Reward: Three-Stage Quality Control

Reward ComponentRoleDescription
Format Reward (r_fmt)Structural complianceXML tag and JSON parsing validation
Validity Reward (r_valid)Internal consistencyGround-truth tool exists in menu, required parameters included, argument values grounded in query
Curriculum Reward (r_curr)Difficulty calibrationTargets Solver success rate p̂_succ within [0.25, 0.75]

The Curriculum Reward is particularly critical. The highest reward is given when generating problems where the Solver’s success rate falls between 25% and 75%. Problems that are too easy (success rate > 75%) or too hard (success rate < 25%) do not contribute meaningfully to learning. This aligns precisely with the pedagogical concept of the “Zone of Proximal Development (ZPD).”

Solver Reward: Fine-Grained Accuracy Measurement

The Solver’s accuracy reward is not a simple correct/incorrect binary judgment but is decomposed along three dimensions:

  1. Tool name matching (binary): Was the correct tool selected?
  2. Key overlap (F1 score): Were required parameters omitted?
  3. Value matching (flexible comparison): Are the argument values accurate?

A multiplicative penalty is applied when extraneous tool calls are generated. This fine-grained reward enables partial credit, providing meaningful gradients even during the early stages of training.

Training Pipeline: The Power of Three Iterations

The entire training process consists of just three iterations:

graph TD
    subgraph "Each Iteration (3 total)"
        A["1. Generator Training<br/>2,000 samples / 50 steps"] --> B["2. Task Synthesis<br/>10,000 candidates generated"]
        B --> C["3. Data Curation<br/>Deduplication + cross-validation<br/>+ difficulty-based sorting"]
        C --> D["4. Solver Training<br/>2,000 selected / 50 steps"]
        D --> E["5. Feedback Loop<br/>Solver performance → Generator condition update"]
    end
    E -->|Next Iteration| A

The remarkable point is that each iteration uses only 2,000 self-generated data samples. This stands in stark contrast to conventional supervised learning approaches that require tens or hundreds of thousands of samples.

Benchmark Results: Outperforming Supervised Learning

Key Results on Qwen2.5-1.5B

BenchmarkBaselineTool-R0Relative Improvement
ToolAlpaca35.96%47.36%+31.7%
SealTools47.27%83.00%+75.6%
NexusRaven17.61%34.59%+86.4%
API-Bank19.13%50.62%+164.6%
SNIPS4.29%20.86%+386.3%
Average24.85%47.84%+92.5%

The dramatic improvements on API-Bank and SNIPS are particularly noteworthy. These benchmarks simulate real-world API call scenarios, making it remarkable that a zero-data approach can achieve this level of performance.

Comparison with Supervised Learning Datasets

The most impressive result is that Tool-R0 outperforms models trained on actual labeled data:

Training MethodData SizeAverage Accuracy
xLAM dataset60,000 samples43.60%
Hammer dataset210,000 samples43.74%
ToolACE dataset12,000 samples44.71%
ToolRL dataset4,000 samples46.06%
Tool-R0 (zero data)0 samples47.84%

Tool-R0, trained with no data, outperforms Hammer — which uses 210,000 training samples — by more than 4 percentage points.

Validation Across Multiple Models

Tool-R0 is not tied to any specific model:

ModelBaselineTool-R0Improvement
Qwen2.5-0.5B15.47%30.57%+101.0%
Qwen2.5-1.5B24.85%47.84%+92.5%
Qwen2.5-3B43.97%48.50%+10.3%
Llama-3.2-3B36.12%40.47%+12.0%

It achieves over 2x improvement on small models (0.5B) and over 10% improvement even on larger models (3B). While the magnitude of improvement decreases for larger models that already possess some tool-calling ability, consistent gains are observed across the board.

Key Finding: Why Parameter Separation Matters

The most important finding from the ablation study is that Generator and Solver parameters must be kept separate:

ConfigurationAccuracyPerformance Drop
Full Tool-R0 (separated)47.84%
Shared weights30.42%-36.4%
Frozen Generator41.65%-12.9%
No difficulty reward43.54%-9.0%

Using shared weights causes a 36.4% performance drop. The research team attributes this to “gradient interference” — when the conflicting objectives of exploration (Generator) and execution (Solver) are optimized within the same parameter space, they undermine each other.

This also carries organizational implications. It provides research-backed evidence that separating the team that defines problems from the team that solves them, while connecting them through feedback loops, is the optimal structure.

Practical Implications for EMs and CTOs

1. Reducing the Cost of Building Enterprise API Tool-Calling Agents

In traditional approaches, the single largest cost was training data construction. Building tens of thousands of tool-calling examples tailored to internal enterprise APIs could take months of work. Tool-R0 eliminates this step entirely.

graph TD
    subgraph "Traditional Approach"
        A1["API documentation analysis<br/>2–4 weeks"] --> B1["Training data construction<br/>4–8 weeks"]
        B1 --> C1["Model training<br/>1–2 weeks"]
        C1 --> D1["Evaluation and tuning<br/>2–4 weeks"]
    end
    subgraph "Tool-R0 Approach"
        A2["API schema definition<br/>1–2 days"] --> B2["Self-Play RL execution<br/>1–3 days"]
        B2 --> C2["Evaluation and deployment<br/>1–2 days"]
    end

2. Reassessing Small Models

Tool-R0 achieves 2x performance improvement even on a 0.5B model. This means that viable tool-calling agents can be built for edge devices and cost-sensitive environments. This is particularly significant for startups with limited GPU budgets or private cloud environments.

3. Automating Curriculum Learning

The most impressive aspect is that the learning curriculum is generated automatically. Previously, humans had to manually sort data from easy to hard examples, but Tool-R0’s Generator automatically detects the Solver’s current skill level and generates problems at the appropriate difficulty.

This opens the door to autonomously operating the training pipeline for AI systems.

Tool-R0 is part of the broader “Self-Evolving Agent” paradigm that defines 2026 AI agent research:

  • EvolveR (ICLR 2026 under review): Experience-based lifecycle for agent self-improvement
  • Agent0: Building agents from zero data through tool-integrated reasoning
  • EvoAgentX (open source on GitHub): Self-evolving agent ecosystem
  • ICLR 2026 Workshop: “Lifelong Agents: Learning, Aligning, Evolving”

The common message across these works is clear: an era is dawning in which agents generate their own training data and evolve on their own, without relying on human-created data.

Conclusion

Tool-R0 is an important study that demonstrates “you can build powerful AI agents without any data.” The key takeaways are:

  1. Self-Play RL alone can outperform supervised learning (92.5% improvement, outperforming a 210K-sample dataset)
  2. Generator-Solver separation is essential (36.4% performance drop when shared)
  3. Automatic curriculum generation is the key to training efficiency (ZPD range [0.25, 0.75])
  4. Effective even on small models (2x improvement on 0.5B)

The most important implication for EMs and CTOs is that a methodology has emerged that can bypass the biggest bottleneck — training data construction — when building AI agents for internal enterprise APIs. Production-level validation is still needed, but this direction is likely to become a significant turning point in AI agent development in 2026.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.