Building Real-Time Voice Agents with Gemini 3.1 Flash Live — Hands-On Impressions
Analyzing Google's Gemini 3.1 Flash Live for building real-time voice and vision agents. Covers API structure, tool calling, 90+ language support, and honest limitations from a developer's perspective.
A new model name appeared in Google AI Studio. gemini-3.1-flash-live-preview. I almost scrolled past it — “another one” — but then I opened the spec sheet and stopped. An audio-in, audio-out model that supports function calling during real-time conversation? Released on March 26, here’s what I found after a day of hands-on testing.
What Changed
Compared to the previous 2.5 Flash Native Audio, there are three key differences.
First, latency dropped noticeably. Google didn’t publish exact numbers, but they described it as having “fewer awkward pauses.” In practice, the perceived delay before response starts is visibly shorter.
Second, conversation context doubled. 131,072 input tokens. Feeding an entire meeting recording and asking questions in real-time is now realistic.
Third, tool calling works during live conversation. It scored 90.8% on ComplexFuncBench Audio. Say “What’s the weather in Seoul today?” and the model calls a function, then returns the result as speech — all within a single stream.
API Structure at a Glance
| Item | Spec |
|---|---|
| Model ID | gemini-3.1-flash-live-preview |
| Input | Text, images, audio, video |
| Output | Text, audio |
| Input tokens | 131,072 |
| Output tokens | 65,536 |
| Function Calling | Supported |
| Thinking | Supported (minimal/low/medium/high) |
| Search Grounding | Supported |
| Languages | 90+ |
The thinkingLevel parameter is interesting. Set it to minimal for lowest latency, or crank it to high for complex reasoning at the cost of slower responses. Having minimal as the default makes sense for voice agents.
Hands-On Impressions
Enable the Live API in Google AI Studio, connect a microphone, and you can start talking right away. What impressed me most was this: you can prototype a voice agent in a browser with zero infrastructure setup. I tried speaking in Korean and the recognition quality was quite decent. Long sentences occasionally got clipped mid-way, though background noise filtering was noticeably improved from the previous version.
I wired up function calling with a simple weather API. Speak a request, the function fires, and the result comes back as speech. The key is that this entire flow happens within a single WebSocket session without interruption. However, asynchronous function calling isn’t supported yet — if an external API is slow, the user has to endure silence.
Migration Notes from 2.5
If you’re upgrading from 2.5 Flash Native Audio, watch out for a few changes:
- Update model strings (
gemini-3.1-flash-live-preview) - New thinking configuration approach (
thinkingLevelparameter) - Server events may contain multiple content parts — update your parsing logic
- Async function calling still not supported
Honest Limitations
I’ll be straightforward about what’s missing.
No structured output. Even for voice agents, backends often want JSON. Currently only text + audio output is available, so post-processing is required.
No caching. Real-time conversations tend to repeat similar queries, and without caching, costs pile up. You’d need a custom cache layer in front for production.
Preview instability. Rate limits and pricing remain unclear. It’s too early for production deployment.
Competitive landscape. OpenAI’s Realtime API already offers similar capabilities, and there are rumors that Anthropic is preparing voice interfaces too. There aren’t enough comparative benchmarks to claim Flash Live is definitively the best.
Where Can You Use This
Realistic scenarios for immediate application:
- Internal voice FAQ bot: Sales team asks product specs by voice, gets instant answers. Wire up internal DB through function calling.
- Multilingual customer support: With 90+ languages, it’s useful for first-line global support. Quality will vary by language, though.
- Real-time meeting assistant: Leverage the large context window for live summarization and action item extraction during meetings.
On the other hand, high-trust scenarios like medical consultation or financial transactions are still off-limits. The hallucination risk of a preview model isn’t acceptable in those domains.
Wrap-Up
Gemini 3.1 Flash Live significantly lowers the bar for building voice agents. The integration of tool calling into real-time voice streams is the decisive difference from the previous generation. But it’s still in preview — pricing, stability, and constraints remain unclear, and production essentials like structured output and caching are missing.
I think this model is currently the most accessible option for voice AI agent prototyping. Production readiness should be reassessed after GA, but right now, it’s absolutely worth firing up a microphone in Google AI Studio to explore the possibilities.
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕