Open-source CSM - 1b - safetensors - quants voice model, supporting text and audio input to generate RVQ audio encoding

Csm 1b Safetensors Quants

Developed by lunahr

CSM (Conversational Speech Model) is a 1-billion-parameter speech generation model developed by Sesame, capable of generating RVQ audio encoding from text and audio inputs.

Speech Synthesis

Transformers

EnglishOpen Source License:Apache-2.0 #Conversational Speech Generation #Multi-speaker Support #Context-aware Synthesis

Downloads 37

Release Time : 3/15/2025

Model Overview

A speech generation model based on the Llama backbone network and a lightweight audio decoder, supporting text-to-speech functionality and outputting Mimi audio encoding.

Model Features

Multi-speaker Support

Allows control of different speaker tones via the speaker parameter

Context-aware Generation

Supports enhanced generation effects through contextual audio segments

Secure Tensor Format

Supports multiple secure tensor formats and tracks download statistics

Model Capabilities

Text-to-speech

Multi-speaker Speech Generation

Context-aware Speech Synthesis

Use Cases

Voice Interaction

Dialogue System Voice Output

Combines with LLM to build a complete dialogue system

Interactive voice demos have been showcased on the blog

Content Creation

Audio Content Generation

Automatically generates voice content such as podcasts and audiobooks

🚀 CSM 1B (Safetensors)

Converted from the original model to various Safetensors formats and tracks downloads.

CSM 1B (Safetensors) is converted from the original version to various Safetensors formats, and it also keeps track of downloads. On March 13, 2025, we released the 1B CSM variant. The code is available on GitHub: SesameAILabs/csm.

CSM (Conversational Speech Model) is a speech generation model from Sesame. It generates RVQ audio codes from text and audio inputs. The model architecture uses a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

A fine - tuned variant of CSM powers the interactive voice demo presented in our blog post. There is also a hosted HuggingFace space available for testing audio generation.

🚀 Quick Start

📦 Installation

Set up the repository with the following steps:

python -m venv .venv
source .venv/bin/activate
curl -s -L https://raw.githubusercontent.com/SesameAILabs/csm/refs/heads/main/requirements.txt | pip install -r /dev/stdin

# You will need access to sesame/csm-1b and meta-llama/Llama-3.2-1B
huggingface-cli login

💻 Usage Examples

🔍 Basic Usage

Generate a simple sentence:

from generator import load_csm_1b
import torchaudio

generator = load_csm_1b(device="cuda")

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

⚙️ Advanced Usage

CSM performs best when provided with context. You can prompt or provide context to the model using a Segment for each speaker utterance:

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

📚 Documentation

❓ FAQ

Does this model come with any voices? The model open - sourced here is a base generation model. It can produce a variety of voices, but it has not been fine - tuned on any specific voice.
Can I converse with the model? CSM is trained to be an audio generation model, not a general - purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
Does it support other languages? The model has some capacity for non - English languages due to data contamination in the training data, but it likely won't perform well.

⚠️ Misuse and abuse

This project provides a high - quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

👥 Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご