Open-source CSM - 1B Conversational Speech Model, Free Deployment, Generate RVQ Audio Encodings from Text and Audio

Csm 1b

Developed by unsloth

CSM (Conversational Speech Model) is a 1B-parameter speech generation model developed by Sesame, capable of generating RVQ audio encoding from text and audio inputs.

Speech Synthesis

Safetensors

EnglishOpen Source License:Apache-2.0 #Conversational Speech Generation #Multi-speaker Support #Context Awareness

Downloads 2,667

Release Time : 5/15/2025

Model Overview

CSM is a speech generation model based on the Llama backbone network and a lightweight audio decoder, capable of generating Mimi audio encoding. Fine-tuned variants of CSM support interactive speech demonstrations.

Model Features

Efficient Performance

1.5x faster with Unsloth runtime, 58% less memory usage

Context Awareness

Supports improved generation quality through contextual audio segments

Multi-speaker Support

Controls different speaker tones via the speaker parameter

Model Capabilities

Text-to-speech generation

Multi-speaker speech synthesis

Context-aware speech generation

Use Cases

Voice Interaction

Conversational Voice Assistant

Converts LLM-generated text into natural speech

Achieves a more natural voice interaction experience

Content Creation

Audio Content Generation

Converts text content into speech

Quickly generates podcasts, audiobooks, and other content

🚀 Run & Fine-tune TTS models with Unsloth!

Unsloth enables you to effortlessly run and fine-tune Text-to-Speech (TTS) models. It offers free notebooks for multiple models, achieving superior accuracy and performance.

📦 Installation

git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# You will need access to sesame/csm-1b and meta-llama/Llama-3.2-1B
huggingface-cli login

💻 Usage Examples

🔍 Basic Usage

from generator import load_csm_1b
import torchaudio

generator = load_csm_1b(device="cuda")

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

🔍 Advanced Usage

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

✨ Features

Multiple Model Support: Supports fine - tuning of various TTS models like Sesame - CSM - 1B, Whisper Large V3, Qwen3 (14B), Llama 3.2 Vision (11B).
High Performance: Achieves superior accuracy and outperforms other leading quants, with significant improvements in speed and memory usage.
Free Notebooks: Provides free Google Colab notebooks for easy fine - tuning.

📚 Documentation

Model Collection: See our collection for all our TTS model uploads.
Fine - Tuning Guide: Learn to fine - tune TTS models - Read our Guide.
Unsloth Dynamic 2.0: Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

📄 License

This project is licensed under the Apache - 2.0 license.

🔧 Technical Details

CSM (Conversational Speech Model) is a speech generation model from Sesame. It generates RVQ audio codes from text and audio inputs. The model architecture uses a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

A fine - tuned variant of CSM powers the interactive voice demo shown in our blog post. A hosted HuggingFace space is also available for testing audio generation.

❓ FAQ

Does this model come with any voices? The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine - tuned on any specific voice.
Can I converse with the model? CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
Does it support other languages? The model has some capacity for non - English languages due to data contamination in the training data, but it likely won't do well.

⚠️ Misuse and abuse

This project provides a high - quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

👥 Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

📊 Model Performance Table

Property	Details
Model Type	CSM (Conversational Speech Model)
Training Data	Not specified
Supported Models	Sesame - CSM - 1B, Whisper Large V3, Qwen3 (14B), Llama 3.2 Vision (11B)
Performance Improvement	Up to 2x faster, up to 70% less memory usage

🔗 Useful Links

See our collection for all our TTS model uploads.

Learn to fine - tune TTS models - Read our Guide.

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご