FunctionGemma 270M — Achieving 90-97% Multi-Turn Tool Calling Accuracy with an Ultra-Small Model
Analysis of how fine-tuning FunctionGemma 270M improved multi-turn tool calling accuracy from 10-39% to 90-97%, matching a 120B teacher model. More evidence that scaling isn't everything.
Overview
Google’s FunctionGemma 270M is a 270M-parameter model purpose-built for function calling. It’s lightweight enough to run at 125 tok/s on a smartphone CPU, but its base multi-turn tool calling accuracy was only 10-39%.
The Distil Labs team fine-tuned this model using knowledge distillation from a 120B teacher, achieving 90-97% accuracy — matching or exceeding the teacher despite being 445× smaller.
This is compelling additional evidence challenging the assumption that scaling is the only path to performance.
Why Multi-Turn Is Hard
Single-turn function calling is relatively straightforward. Multi-turn introduces compounding challenges:
- Conversation history tracking: Must remember previous function call results
- Intent change handling: Users may shift intent mid-conversation
- Cumulative errors: 80% single-turn accuracy drops to 33% over 5 turns (0.8⁵)
Base FunctionGemma’s projected 5-turn accuracy is effectively unusable:
| Task | Single-Turn | 5-Turn Projected |
|---|---|---|
| Smart Home Control | 38.8% | ~0.9% |
| Banking Voice Assistant | 23.4% | ~0.07% |
| Shell Command Execution | 9.9% | ~0.001% |
Fine-Tuning Results
Distil Labs performed knowledge distillation from a 120B GPT-oss teacher model. The results were remarkable:
graph LR
A[Base FunctionGemma<br/>10-39%] -->|Fine-tuning| B[Tuned FunctionGemma<br/>90-97%]
C[120B Teacher<br/>92-97%] -.->|Knowledge Distillation| B
style A fill:#ff6b6b,color:#fff
style B fill:#51cf66,color:#fff
style C fill:#339af0,color:#fff
Detailed Results by Task
| Task | Base | Tuned | Teacher (120B) |
|---|---|---|---|
| Smart Home Control | 38.8% | 96.7% | 92.1% |
| Banking Voice Assistant | 23.4% | 90.9% | 97.0% |
| Shell Command Execution | 9.9% | 96.0% | 97.0% |
The tuned model beat the 120B teacher on smart home control and shell commands. Only the banking task fell short — the most complex task with 14 functions and ASR noise in the input.
Key Insights
1. Data Quality > Model Size
The same high-quality dataset produced strong results on both Qwen3-0.6B and FunctionGemma 270M. The key factor is task-specific, high-quality training data, not model size.
2. Practical Implications of a 445× Smaller Model
| Metric | 120B Teacher | 270M Tuned |
|---|---|---|
| Parameters | 120,000M | 270M |
| Quantized Size | ~60GB+ | ~288MB |
| Runtime | GPU Server | Smartphone CPU |
| Inference Speed | - | 125 tok/s |
This enables production-ready tool calling without GPUs — on edge devices, mobile apps, and in-browser inference.
3. A Counter-Argument to Scaling Laws
Combined with the recent rise of open-source models like DeepSeek and Qwen, these results provide additional evidence against the assumption that increasing parameters is the only path to better performance. Proper fine-tuning on specialized tasks can overcome model size limitations.
Open-Source Resources
All models and datasets are publicly available for reproduction:
- Smart Home Model: distil-labs/distil-home-assistant-functiongemma
- Smart Home Data: distil-labs/distil-smart-home
- Banking Assistant Data: distil-labs/distil-voice-assistant-banking
- Shell Command Data: distil-labs/distil-SHELLper
Conclusion
The FunctionGemma 270M fine-tuning case sends an important message to the AI industry. A 270M model beating a 120B model means not every problem requires a massive model.
As demand for tool calling grows in constrained environments — edge AI, mobile deployment, IoT devices — the potential of ultra-small specialized models will only become more significant.
References
- Making FunctionGemma Work: Multi-Turn Tool Calling at 270M Parameters — Distil Labs Blog
- Reddit Discussion — r/LocalLLaMA
- FunctionGemma Model Card — HuggingFace
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕