tts-1.6b-en_fr Open-source Text-to-Speech Model - Supports Real-time Multilingual Speech Generation

Tts 1.6b En Fr

Developed by kyutai

The Kyoto Station Text-to-Speech (TTS) model is a model for streaming text-to-speech, supporting real-time speech generation and multilingual processing.

Speech Synthesis Supports Multiple Languages#Streaming speech generation #Real-time TTS #Multilingual TTS

Downloads 1,441

Release Time : 6/30/2025

Model Overview

This model uses a hierarchical Transformer architecture and supports streaming text-to-speech generation in English and French, with efficient generation and speech adjustment functions.

Model Features

Streaming speech generation

No need to wait for the complete text input. Audio output can start after receiving the first few words, improving real-time performance.

Multilingual support

Supports text-to-speech for both English and French.

Efficient generation

Improve the generation speed through CFG distillation training, making it easy for batch processing. 75 times the audio can be generated per computing unit time.

Speech adjustment

Supports speech adjustment through precomputed embeddings.

Model Capabilities

Streaming text-to-speech

Multilingual speech generation

Real-time speech output

Speech style adjustment

Use Cases

Real-time dialogue

Speech generation in dialogue scenarios

Generate speech responses in real-time in dialogue scenarios to enhance the interaction experience.

Achieve low-latency speech output

Multilingual applications

Multilingual speech synthesis

Generate natural speech for English and French content.

Support smooth speech output in two languages

🚀 Model Card for Kyutai TTS

This is a model for streaming text - to - speech (TTS), enabling audio output as soon as the first few words of the input text are provided.

See also the project page, the Colab example, and the GitHub repository. Pre - print research paper is coming soon!

🚀 Quick Start

To get started with the model, see the GitHub repository.

✨ Features

This is a streaming text - to - speech (TTS) model. Unlike offline TTS models that require the entire text to generate audio, our model starts outputting audio as soon as the first few words of the text are input.

📚 Documentation

Model Details

The model architecture is a hierarchical Transformer that consumes tokenized text and generates audio tokenized by Mimi, see the Moshi paper. The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens. You can use fewer tokens at inference time for faster generation. The backbone model has 1B parameters, and the depth transformer has 600M parameters and uses partial weight sharing similar to Hibiki. The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2.

Model Description

Kyutai TTS is a decoder - only model for streaming speech - to - text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.

Property	Details
Developed by	Kyutai
Model Type	Streaming Text - To - Speech
Language(s) (NLP)	English and French
License	Model weights are licensed under CC - BY 4.0
Repository	GitHub

Uses

Direct Use

This model can perform streaming text - to - speech generation, including dialogs. It supports voice conditioning through cross - attention pre - computed embeddings, which are provided for a number of voices in our [tts - voices](https://huggingface.co/kyutai/tts - voices) repository. This model does not support Classifier Free Guidance (CFG) directly, but was trained with CFG distillation for improved speed (no need to double the batch size). It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time.

⚠️ Important Note

This model does not perform watermarking for two reasons:

Watermarking can easily be deactivated for open - source models.

Early experiments show that all watermark systems used by existing TTS are removed by simply encoding and decoding the audio with Mimi. Instead, voice cloning ability is restricted to the use of pre - computed voice embeddings.

🔧 Technical Details

Training Details

The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds. Then, CFG distillation was performed for 24k updates.

Training Data

Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content. For this dataset, we obtained synthetic transcripts by running [whisper - timestamped](https://github.com/linto - ai/whisper - timestamped) with whisper - medium.

Compute Infrastructure

Pretraining was done with 32 H100 Nvidia GPUs. CFG distillation was done on 8 such GPUs.

📄 License

The model weights are licensed under CC - BY 4.0.

Model Card Authors

Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご