Open-source speech generation model csm-1b - Free to generate RVQ audio encoding from text and audio inputs

Csm 1b

Developed by sesame

CSM is a 1-billion-parameter voice generation model developed by Sesame, capable of generating RVQ audio encoding from text and audio inputs

Speech Synthesis

Safetensors

EnglishOpen Source License:Apache-2.0 #Multi-turn Dialogue Voice Generation #High-fidelity Timbre Control #Context-aware Synthesis

Downloads 65.03k

Release Time : 3/6/2025

Model Overview

A conversational voice model utilizing Llama backbone network and lightweight audio decoder architecture, capable of generating Mimi audio encoding, suitable for text-to-speech tasks

Model Features

Context-aware Generation

Supports generating more natural conversational speech through contextual audio segments

Multi-timbre Support

Base model can generate multiple timbres (specific timbres require fine-tuning)

Efficient Architecture

Combines Llama backbone network with lightweight decoder to balance performance and efficiency

Model Capabilities

Text-to-speech generation

Conversational speech synthesis

Multi-speaker voice generation

Use Cases

Voice Interaction

Virtual Assistant

Generates natural speech responses for dialogue systems

Demonstration shows smooth conversational interaction effects

Content Creation

Audio Content Generation

Converts text content into speech

🚀 CSM 1B

CSM (Conversational Speech Model) is a speech generation model from Sesame. It can generate RVQ audio codes from text and audio inputs. The model architecture uses a Llama backbone and a smaller audio decoder to produce Mimi audio codes.

2025/03/13 - We are releasing the 1B CSM variant. Code is available on GitHub: SesameAILabs/csm.

A fine - tuned variant of CSM powers the interactive voice demo shown in our blog post.

A hosted HuggingFace space is also available for testing audio generation.

🚀 Quick Start

📦 Installation

Setup the repo

git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# You will need access to sesame/csm-1b and meta-llama/Llama-3.2-1B
huggingface-cli login

💻 Usage Examples

Basic Usage

Generate a sentence

from generator import load_csm_1b
import torchaudio

generator = load_csm_1b(device="cuda")

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Advanced Usage

CSM sounds best when provided with context. You can prompt or provide context to the model using a Segment for each speaker utterance.

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

📚 Documentation

FAQ

Does this model come with any voices?

The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine - tuned on any specific voice.

Can I converse with the model?

CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

Does it support other languages?

The model has some capacity for non - English languages due to data contamination in the training data, but it likely won't do well.

Misuse and abuse ⚠️

⚠️ Important Note

This project provides a high - quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.

Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.

Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

📄 License

This project is under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご