CSM-1B Open-Source Speech Generation Model - Free Generation of Audio Encoding from Text and Audio Inputs

Csm 1b

Developed by chutesai

CSM (Conversational Speech Model) is a 1-billion-parameter speech generation model developed by Sesame, capable of generating RVQ audio encoding from text and audio inputs.

Speech Synthesis

Transformers

EnglishOpen Source License:Apache-2.0 #Multi-speaker speech generation #Context-aware TTS #Llama architecture audio model

Downloads 814

Release Time : 3/18/2025

Model Overview

CSM is a speech generation model based on the Llama backbone network and a lightweight audio decoder, supporting the generation of Mimi audio encoding from text and audio inputs, suitable for text-to-speech tasks.

Model Features

Multi-tone generation

The base generation model can produce various tones, supporting tone performance optimization through contextual prompts.

Context-aware

Providing conversational context (text + audio) can significantly improve generation quality.

Efficient architecture

Based on the Llama backbone network and lightweight decoder, balancing performance and efficiency.

Model Capabilities

Text-to-speech

Multi-tone speech generation

Context-aware speech synthesis

Use Cases

Voice interaction

Conversational voice assistant

Combine LLM-generated text with natural speech conversion

Achieve more natural voice interaction experiences

Content creation

Audio content generation

Automatically convert text content into speech

Efficiently generate audiobooks, podcasts, and other audio content

🚀 CSM 1B (Safetensors)

CSM 1B (Safetensors) is a speech generation model that can generate RVQ audio codes from text and audio inputs. It offers a high - quality solution for speech generation and is suitable for research and educational use.

🚀 Quick Start

Setup the Repo

git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Generate a Sentence

from generator import load_csm_1b
import torchaudio

generator = load_csm_1b(device="cuda")
audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Generate with Context

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

✨ Features

Format Conversion: Converted from the original version to the Safetensors FP16 format, with an updated config and code pointing to ungated llama. It also tracks downloads.
Model Architecture: The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
Fine - Tuned Variant: A fine - tuned variant of CSM powers the interactive voice demo shown in the blog post.
Hosted Space: A hosted HuggingFace space is available for testing audio generation.

📚 Documentation

Model Introduction

Safetensors format from here, with an updated config and code pointing to ungated llama. Converted from the original version to the Safetensors FP16 format.

2025/03/13 - The 1B CSM variant is released. Code is available on GitHub: SesameAILabs/csm.

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.

FAQ

Does this model come with any voices?

The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine - tuned on any specific voice.

Can I converse with the model?

CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

Does it support other languages?

The model has some capacity for non - English languages due to data contamination in the training data, but it likely won't do well.

Misuse and abuse ⚠️

This project provides a high - quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご