Building Real-Time Voice Agents with Gemini 3.1 Flash Live — Hands-On Impressions

Building Real-Time Voice Agents with Gemini 3.1 Flash Live — Hands-On Impressions

Analyzing Google's Gemini 3.1 Flash Live for building real-time voice and vision agents. Covers API structure, tool calling, 90+ language support, and honest limitations from a developer's perspective.

A new model name appeared in Google AI Studio. gemini-3.1-flash-live-preview. I almost scrolled past it — “another one” — but then I opened the spec sheet and stopped. An audio-in, audio-out model that supports function calling during real-time conversation? Released on March 26, here’s what I found after a day of hands-on testing.

What Changed

Compared to the previous 2.5 Flash Native Audio, there are three key differences.

First, latency dropped noticeably. Google didn’t publish exact numbers, but they described it as having “fewer awkward pauses.” In practice, the perceived delay before response starts is visibly shorter.

Second, conversation context doubled. 131,072 input tokens. Feeding an entire meeting recording and asking questions in real-time is now realistic.

Third, tool calling works during live conversation. It scored 90.8% on ComplexFuncBench Audio. Say “What’s the weather in Seoul today?” and the model calls a function, then returns the result as speech — all within a single stream.

API Structure at a Glance

ItemSpec
Model IDgemini-3.1-flash-live-preview
InputText, images, audio, video
OutputText, audio
Input tokens131,072
Output tokens65,536
Function CallingSupported
ThinkingSupported (minimal/low/medium/high)
Search GroundingSupported
Languages90+

The thinkingLevel parameter is interesting. Set it to minimal for lowest latency, or crank it to high for complex reasoning at the cost of slower responses. Having minimal as the default makes sense for voice agents.

Hands-On Impressions

Enable the Live API in Google AI Studio, connect a microphone, and you can start talking right away. What impressed me most was this: you can prototype a voice agent in a browser with zero infrastructure setup. I tried speaking in Korean and the recognition quality was quite decent. Long sentences occasionally got clipped mid-way, though background noise filtering was noticeably improved from the previous version.

I wired up function calling with a simple weather API. Speak a request, the function fires, and the result comes back as speech. The key is that this entire flow happens within a single WebSocket session without interruption. However, asynchronous function calling isn’t supported yet — if an external API is slow, the user has to endure silence.

Migration Notes from 2.5

If you’re upgrading from 2.5 Flash Native Audio, watch out for a few changes:

  • Update model strings (gemini-3.1-flash-live-preview)
  • New thinking configuration approach (thinkingLevel parameter)
  • Server events may contain multiple content parts — update your parsing logic
  • Async function calling still not supported

Honest Limitations

I’ll be straightforward about what’s missing.

No structured output. Even for voice agents, backends often want JSON. Currently only text + audio output is available, so post-processing is required.

No caching. Real-time conversations tend to repeat similar queries, and without caching, costs pile up. You’d need a custom cache layer in front for production.

Preview instability. Rate limits and pricing remain unclear. It’s too early for production deployment.

Competitive landscape. OpenAI’s Realtime API already offers similar capabilities, and there are rumors that Anthropic is preparing voice interfaces too. There aren’t enough comparative benchmarks to claim Flash Live is definitively the best.

Where Can You Use This

Realistic scenarios for immediate application:

  • Internal voice FAQ bot: Sales team asks product specs by voice, gets instant answers. Wire up internal DB through function calling.
  • Multilingual customer support: With 90+ languages, it’s useful for first-line global support. Quality will vary by language, though.
  • Real-time meeting assistant: Leverage the large context window for live summarization and action item extraction during meetings.

On the other hand, high-trust scenarios like medical consultation or financial transactions are still off-limits. The hallucination risk of a preview model isn’t acceptable in those domains.

Wrap-Up

Gemini 3.1 Flash Live significantly lowers the bar for building voice agents. The integration of tool calling into real-time voice streams is the decisive difference from the previous generation. But it’s still in preview — pricing, stability, and constraints remain unclear, and production essentials like structured output and caching are missing.

I think this model is currently the most accessible option for voice AI agent prototyping. Production readiness should be reassessed after GA, but right now, it’s absolutely worth firing up a microphone in Google AI Studio to explore the possibilities.

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.