KaniTTS2 — Open-Source 400M TTS Model with Voice Cloning on Just 3GB VRAM

KaniTTS2 — Open-Source 400M TTS Model with Voice Cloning on Just 3GB VRAM

KaniTTS2, a lightweight 400M-parameter TTS model, has been open-sourced with voice cloning capabilities requiring only 3GB VRAM. With full pretraining code released, it marks a milestone in voice AI democratization.

Overview

Text-to-speech (TTS) technology has traditionally required large-scale models and high-end GPUs. However, the release of KaniTTS2 significantly lowers these barriers. With just 400 million parameters and 3GB of VRAM, this model enables real-time voice cloning, making it one of the latest examples of voice AI democratization.

Released by the nineninesix-ai team under the Apache 2.0 license, this project goes beyond model weights — it includes the complete pretraining code, allowing anyone to train their own TTS model from scratch.

KaniTTS2 Key Specs

SpecificationDetails
Parameters400M (BF16)
Sample Rate22kHz
GPU VRAM3GB
RTF (Real-Time Factor)~0.2 (on RTX 5090)
Training Data~10,000 hours of speech
Training Time6 hours on 8x H100s
LanguagesEnglish, Spanish (more coming)
LicenseApache 2.0

An RTF of 0.2 means generating 1 second of speech takes only 0.2 seconds — more than fast enough for real-time conversational use cases.

Why KaniTTS2 Matters

1. Extreme Lightweight Design

Previous high-quality TTS models often required billions of parameters and 10GB+ of VRAM. KaniTTS2 achieves competitive quality with 400M parameters, running on consumer-grade GPUs like the RTX 3060.

2. Complete Open-Source Pretraining Framework

Beyond model weights, the entire pretraining codebase is publicly available. This opens up possibilities for:

  • Training TTS models for underrepresented languages
  • Building domain-specific voice models (medical, legal, etc.)
  • Customizing for specific accents and dialects

3. Built-in Voice Cloning

Voice cloning is built into the model without requiring separate fine-tuning. Simply provide a reference audio sample, and the model generates speech in that speaker’s voice.

Architecture and Training

graph LR
    A[Text Input] --> B[Text Encoder]
    B --> C[KaniTTS2 Core<br/>400M params]
    D[Reference Voice] --> C
    C --> E[Speech Decoder]
    E --> F[22kHz Audio Output]

Training uses approximately 10,000 hours of speech data and completes in just 6 hours on 8 H100 GPUs. This is remarkably efficient compared to large-scale TTS models that can take days or weeks to train.

Getting Started

Download from HuggingFace

KaniTTS2 offers two model variants:

  • Multilingual Model (Pretrained): English and Spanish support
  • English-Only Model: Optimized for English with local accent support
# Download from HuggingFace
# Multilingual model
git clone https://huggingface.co/nineninesix/kani-tts-2-pt

# English-only model
git clone https://huggingface.co/nineninesix/kani-tts-2-en

Try the Demo on HuggingFace Spaces

Experience the model directly in your browser without installation:

Train Your Own Model

With the pretraining code, you can build a TTS model from scratch:

# Clone the pretraining code
git clone https://github.com/nineninesix-ai/kani-tts-2-pretrain
cd kani-tts-2-pretrain

# Follow the README for setup and training instructions

Lightweight TTS Model Comparison

Several locally-runnable TTS models have emerged recently:

ModelParametersVRAMVoice CloningOpen-Source Training Code
KaniTTS2400M3GB
Bark~1B6GB+
XTTS v2~500M4GB+Partial
Piper~60M<1GB

KaniTTS2 stands out by providing both voice cloning and complete pretraining code while maintaining a lightweight footprint.

What Voice AI Democratization Means

The release of KaniTTS2 goes beyond a simple model drop — it’s a significant milestone for voice AI democratization:

  1. Underrepresented Language Support: Open pretraining code enables communities to build TTS models for their own languages
  2. Cost Barrier Removal: 3GB VRAM is enough, eliminating the need for expensive GPUs
  3. Research Acceleration: Full training pipeline disclosure improves reproducibility and speeds up TTS research
  4. Personal Privacy: Running locally instead of through cloud APIs ensures voice data privacy

Conclusion

KaniTTS2 exemplifies the “small but mighty” model philosophy. With voice cloning capabilities packed into just 400M parameters, it challenges the notion that only large models can deliver high-quality speech synthesis.

The complete release of pretraining code is expected to positively impact the entire voice AI ecosystem — from underrepresented language support to domain-specific optimization and personalized voice assistant development.

As local AI continues to grow more powerful, KaniTTS2 proves that the “local-first” approach is becoming a reality in voice AI as well.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.