🚀 Model Card for Kyutai TTS
This is a model for streaming text - to - speech (TTS), enabling audio output as soon as the first few words of the input text are provided.
See also the project page,
the Colab example,
and the GitHub repository. Pre - print research paper is coming soon!
🚀 Quick Start
To get started with the model, see the GitHub repository.
✨ Features
This is a streaming text - to - speech (TTS) model. Unlike offline TTS models that require the entire text to generate audio, our model starts outputting audio as soon as the first few words of the text are input.
📚 Documentation
Model Details
The model architecture is a hierarchical Transformer that consumes tokenized text and generates audio tokenized by Mimi, see the Moshi paper.
The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens. You can use fewer tokens at inference time for faster generation.
The backbone model has 1B parameters, and the depth transformer has 600M parameters and uses partial weight sharing similar to Hibiki.
The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2.
Model Description
Kyutai TTS is a decoder - only model for streaming speech - to - text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.
Property |
Details |
Developed by |
Kyutai |
Model Type |
Streaming Text - To - Speech |
Language(s) (NLP) |
English and French |
License |
Model weights are licensed under CC - BY 4.0 |
Repository |
GitHub |
Uses
Direct Use
This model can perform streaming text - to - speech generation, including dialogs. It supports voice conditioning through cross - attention pre - computed embeddings, which are provided for a number of voices in our [tts - voices](https://huggingface.co/kyutai/tts - voices) repository.
This model does not support Classifier Free Guidance (CFG) directly, but was trained with CFG distillation for improved speed (no need to double the batch size). It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time.
⚠️ Important Note
This model does not perform watermarking for two reasons:
- Watermarking can easily be deactivated for open - source models.
- Early experiments show that all watermark systems used by existing TTS are removed by simply encoding and decoding the audio with Mimi. Instead, voice cloning ability is restricted to the use of pre - computed voice embeddings.
🔧 Technical Details
Training Details
The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds. Then, CFG distillation was performed for 24k updates.
Training Data
Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content. For this dataset, we obtained synthetic transcripts by running [whisper - timestamped](https://github.com/linto - ai/whisper - timestamped) with whisper - medium
.
Compute Infrastructure
Pretraining was done with 32 H100 Nvidia GPUs. CFG distillation was done on 8 such GPUs.
📄 License
The model weights are licensed under CC - BY 4.0.
Model Card Authors
Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez