FunctionGemma 270M — Achieving 90-97% Multi-Turn Tool Calling Accuracy with an Ultra-Small Model

FunctionGemma 270M — Achieving 90-97% Multi-Turn Tool Calling Accuracy with an Ultra-Small Model

Analysis of how fine-tuning FunctionGemma 270M improved multi-turn tool calling accuracy from 10-39% to 90-97%, matching a 120B teacher model. More evidence that scaling isn't everything.

Overview

Google’s FunctionGemma 270M is a 270M-parameter model purpose-built for function calling. It’s lightweight enough to run at 125 tok/s on a smartphone CPU, but its base multi-turn tool calling accuracy was only 10-39%.

The Distil Labs team fine-tuned this model using knowledge distillation from a 120B teacher, achieving 90-97% accuracy — matching or exceeding the teacher despite being 445× smaller.

This is compelling additional evidence challenging the assumption that scaling is the only path to performance.

Why Multi-Turn Is Hard

Single-turn function calling is relatively straightforward. Multi-turn introduces compounding challenges:

  • Conversation history tracking: Must remember previous function call results
  • Intent change handling: Users may shift intent mid-conversation
  • Cumulative errors: 80% single-turn accuracy drops to 33% over 5 turns (0.8⁵)

Base FunctionGemma’s projected 5-turn accuracy is effectively unusable:

TaskSingle-Turn5-Turn Projected
Smart Home Control38.8%~0.9%
Banking Voice Assistant23.4%~0.07%
Shell Command Execution9.9%~0.001%

Fine-Tuning Results

Distil Labs performed knowledge distillation from a 120B GPT-oss teacher model. The results were remarkable:

graph LR
    A[Base FunctionGemma<br/>10-39%] -->|Fine-tuning| B[Tuned FunctionGemma<br/>90-97%]
    C[120B Teacher<br/>92-97%] -.->|Knowledge Distillation| B
    style A fill:#ff6b6b,color:#fff
    style B fill:#51cf66,color:#fff
    style C fill:#339af0,color:#fff

Detailed Results by Task

TaskBaseTunedTeacher (120B)
Smart Home Control38.8%96.7%92.1%
Banking Voice Assistant23.4%90.9%97.0%
Shell Command Execution9.9%96.0%97.0%

The tuned model beat the 120B teacher on smart home control and shell commands. Only the banking task fell short — the most complex task with 14 functions and ASR noise in the input.

Key Insights

1. Data Quality > Model Size

The same high-quality dataset produced strong results on both Qwen3-0.6B and FunctionGemma 270M. The key factor is task-specific, high-quality training data, not model size.

2. Practical Implications of a 445× Smaller Model

Metric120B Teacher270M Tuned
Parameters120,000M270M
Quantized Size~60GB+~288MB
RuntimeGPU ServerSmartphone CPU
Inference Speed-125 tok/s

This enables production-ready tool calling without GPUs — on edge devices, mobile apps, and in-browser inference.

3. A Counter-Argument to Scaling Laws

Combined with the recent rise of open-source models like DeepSeek and Qwen, these results provide additional evidence against the assumption that increasing parameters is the only path to better performance. Proper fine-tuning on specialized tasks can overcome model size limitations.

Open-Source Resources

All models and datasets are publicly available for reproduction:

Conclusion

The FunctionGemma 270M fine-tuning case sends an important message to the AI industry. A 270M model beating a 120B model means not every problem requires a massive model.

As demand for tool calling grows in constrained environments — edge AI, mobile deployment, IoT devices — the potential of ultra-small specialized models will only become more significant.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.