Mistral Voxtral TTS — 3-Second Voice Cloning, Open Weight, But No Japanese

On March 26, Mistral AI released Voxtral — a TTS model with 4B parameters, open weights, and voice cloning from just 3 seconds of audio. On specs alone, this is the most aggressive TTS announcement of the year.

I’d just spent the past two weeks comparing Chirp3 HD, Gemini TTS, and Edge TTS for a blog video automation pipeline. So when Voxtral dropped, my first reaction was “can I plug this into the pipeline?” The short answer: not yet. Here’s why.

Voxtral by the Numbers

Here’s what Voxtral brings to the table:

Parameters: 4B (Hugging Face: mistralai/Voxtral-4B-TTS-2603)
Latency: 70ms model latency, streaming supported
Voice cloning: Zero-shot/few-shot from a 3-second audio sample
Languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic — 9 languages
License: CC BY NC 4.0 (free for non-commercial use)
Commercial API: $0.016 / 1,000 characters

It reportedly beat ElevenLabs Flash v2.5 in human evaluations. This is Mistral’s own claim, so independent verification is needed, but publishing blind A/B test results at all signals confidence.

The 4B size is also noteworthy. Recent smartphone NPUs can handle 7-10B inference, making Voxtral a realistic candidate for on-device TTS. Not as tiny as Kitten TTS (14M), but the quality-to-size ratio is compelling.

TTS Engine Comparison: Engines I’ve Actually Used

Honestly, TTS comparison tables are meaningless from specs alone. The same “natural voice” can vary wildly in prosody, emotional expression, and per-language quality. Still, here’s how Voxtral stacks up against the engines I evaluated for my pipeline:

Feature	Voxtral	Gemini TTS (Charon)	ElevenLabs Flash v2.5	Chirp3 HD
Parameters	4B	Undisclosed	Undisclosed	Undisclosed
Open weight	Yes (CC BY NC 4.0)	No	No	No
Voice cloning	3s zero-shot	No	Yes (instant)	No
Japanese	No	Yes	Yes	Yes
Korean	No	Yes	Yes	Yes
On-device	Yes (4B)	No	No	No
Latency	70ms	~200ms	~100ms	~300ms
Price (1k chars)	$0.016	$0.001~	$0.30	Free (limited)

Two things jump out at me. First, Voxtral is the only one that’s both open-weight and on-device capable. That’s a value proposition no other engine can match. Second, Japanese and Korean are missing. More on that below.

What 3-Second Voice Cloning Means

Voxtral’s zero-shot voice cloning replicates a speaker’s timbre, intonation, and pace from a single 3-second audio sample. Few-shot mode improves accuracy further.

# Voice cloning via Voxtral API
import requests

response = requests.post(
    "https://api.mistral.ai/v1/audio/speech",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "voxtral-4b-tts-2603",
        "input": "Hello, this is a voice cloning demonstration.",
        "voice_sample": "base64_encoded_3sec_audio",  # 3-second sample
        "language": "en",
        "stream": True
    }
)

Where KaniTTS2 needed 3GB VRAM for voice cloning, Voxtral does it in one API call. Running locally means downloading weights from Hugging Face, but at 4B parameters, 8GB VRAM should suffice.

Honestly, I can’t judge the actual quality of 3-second cloning without hearing it myself. Mistral’s demo samples were impressive, but demos always showcase the best cases.

The Critical Gap: Japanese and Korean

Here’s the main point. Japanese is not in the 9 supported languages. Neither is Korean.

For someone like me who creates content primarily targeting the Japanese market, this isn’t “disappointing” — it means “unusable.” My blog video pipeline needs audio in 4 languages (ko, ja, en, zh), and Voxtral covers only English. Chinese is also missing.

Mistral is a French company, and the language list — European languages plus Hindi and Arabic — makes the market priorities clear. Asian language support is presumably on the roadmap, but right now, integration into our pipeline is a non-starter.

This is my biggest frustration with Voxtral. Not the model quality or architecture, but the language coverage that prevents me from even properly evaluating it.

The License Catch

The CC BY NC 4.0 license also deserves attention. “Open weight” doesn’t mean you can freely commercialize it.

Non-commercial: Free, unrestricted use
Commercial: Must use Mistral’s API ($0.016/1k chars)
Fine-tuning: You can download weights for research/personal fine-tuning, but commercial distribution of derivatives violates the license

This differs from Meta’s Llama (community license) or Stability AI’s models (open license). Mistral is using open weights as a marketing tool while monetizing through the API. Not a bad strategy, but expectations around the word “open” need calibrating.

Who Should Use It

Bottom line — Voxtral is attractive if these conditions apply:

English-centric projects: English TTS quality is the top priority and you need open weights
On-device TTS: The 4B size is practical for mobile/edge deployment
Voice cloning as a core feature: 3-second zero-shot is competitive
European multilingual needs: Good choice if you need French, German, Spanish, etc.

For projects requiring Japanese, Korean, or Chinese, Gemini TTS or ElevenLabs remain the practical options. I plan to continue using Edge TTS (free, multilingual) and Gemini TTS (best quality) in my video pipeline for now. When Voxtral adds Asian languages, I’ll revisit the evaluation.

I’m rooting for open-weight TTS as a direction. The speech synthesis space is developing the same dynamic as LLMs, where open models chase commercial services — and Voxtral is the strongest signal in that trend. But there’s still distance between a signal and production deployment.

Reading Complete!

Mistral Voxtral TTS — 3-Second Voice Cloning, Open Weight, But No Japanese

Voxtral by the Numbers

TTS Engine Comparison: Engines I’ve Actually Used

What 3-Second Voice Cloning Means

The Critical Gap: Japanese and Korean

The License Catch

Who Should Use It

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Reading Complete!

Voxtral by the Numbers

TTS Engine Comparison: Engines I’ve Actually Used

What 3-Second Voice Cloning Means

The Critical Gap: Japanese and Korean

The License Catch

Who Should Use It

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Related Articles

Does AGENTS.md Actually Work? The First Empirical Study Reveals Surprising Results

Claude Sonnet 4.6 Release: Anthropic's Mid-Tier Model Strategy and Performance Analysis

FunctionGemma 270M — Achieving 90-97% Multi-Turn Tool Calling Accuracy with an Ultra-Small Model