csm-1b开源语音生成模型 - 支持文本音频输入，带上下文生成音频编码

Home

Csm 1b

Developed by eustlb

CSM是由Sesame开发的1B参数语音生成模型，可通过文本和音频输入生成RVQ音频编码，支持带上下文的语音生成。

语音合成

Safetensors

EnglishOpen Source License:Apache-2.0 #对话语音生成 #多说话人支持 #上下文感知

Downloads 5,144

Release Time : 3/26/2025

Model Overview

基于Llama主干网络和轻量级音频解码器的语音生成模型，可输出Mimi音频编码，适用于文本转语音任务。

Model Features

上下文感知生成

支持通过历史对话音频和文本作为上下文输入，优化当前语音生成效果

高效架构设计

采用Llama主干网络结合轻量级解码器，平衡生成质量与计算效率

多模态输入

支持同时处理文本和音频输入，实现更自然的语音交互

Model Capabilities

文本转语音生成

上下文感知语音合成

多说话人语音生成

Use Cases

交互式语音应用

语音助手

为对话系统提供自然语音输出

演示案例显示可生成带情感语调的语音

内容创作

有声内容生成

将文本内容自动转换为语音

🚀 CSM 1B

CSM（对话语音模型）是由 Sesame 推出的语音生成模型，它能够根据文本和音频输入生成 RVQ 音频代码。该模型采用 Llama 作为主干架构，并配备一个较小的音频解码器，用于生成 Mimi 音频代码。

🚀 快速开始

2025/03/13 - 我们发布了 1B 版本的 CSM 变体。原始代码可在 GitHub 上获取：SesameAILabs/csm。
2025/05/07 - Transformers 开始支持 CSM。

一个经过微调的 CSM 变体为我们博客文章中展示的交互式语音演示提供了支持。此外，还提供了一个托管的 HuggingFace 空间用于测试音频生成。

✨ 主要特性

基于文本和音频输入生成 RVQ 音频代码。
采用 Llama 主干架构和小型音频解码器。
支持批量推理和 CUDA 图的全图编译。
可使用 Transformers 的 Trainer 进行微调。

📦 安装指南

文档未提供具体安装步骤，故跳过该章节。

💻 使用示例

基础用法

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
text = "[0]The past is just a story we tell ourselves." # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)

# another equivalent way to prepare the inputs
conversation = [
    {"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]},
]
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_without_context.wav")

高级用法

提供上下文时使用

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio

model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = []

# 1. context
for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
    conversation.append(
        {
            "role": f"{speaker_id}",
            "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
        }
    )

# 2. text prompt
conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_with_context.wav")

批量推理

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio

model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs 
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# here a batch with two prompts
conversation = [
    [
        {
            "role": f"{ds[0]['speaker_id']}",
            "content": [
                {"type": "text", "text": ds[0]["text"]},
                {"type": "audio", "path": ds[0]["audio"]["array"]},
            ],
        },
        {
            "role": f"{ds[1]['speaker_id']}",
            "content": [
                {"type": "text", "text": ds[1]["text"]},
            ],
        },
    ],
    [
        {
            "role": f"{ds[0]['speaker_id']}",
            "content": [
                {"type": "text", "text": ds[0]["text"]},
            ],
        }
    ],
]
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))])

CUDA 图全图编译

import torch
import copy
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset

model_id = "eustlb/csm-1b"
device = "cuda"

# set logs to ensure no recompilation and graph breaks
torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True)

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
model.generation_config.max_length = 250 # big enough to avoid recompilation
model.generation_config.max_new_tokens = None # would take precedence over max_length
model.generation_config.cache_implementation = "static"
model.depth_decoder.generation_config.cache_implementation = "static"

# generation kwargs
gen_kwargs = {
    "do_sample": False,
    "depth_decoder_do_sample": False,
    "temperature": 1.0,
    "depth_decoder_temperature": 1.0,
}

# Define a timing decorator
class TimerContext:
    def __init__(self, name="Execution"):
        self.name = name
        self.start_event = None
        self.end_event = None
        
    def __enter__(self):
        # Use CUDA events for more accurate GPU timing
        self.start_event = torch.cuda.Event(enable_timing=True)
        self.end_event = torch.cuda.Event(enable_timing=True)
        self.start_event.record()
        return self

    def __exit__(self, *args):
        self.end_event.record()
        torch.cuda.synchronize()
        elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
        print(f"{self.name} time: {elapsed_time:.4f} seconds")

# prepare the inputs 
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")

conversation = [
    {
        "role": f"{ds[0]['speaker_id']}",
        "content": [
            {"type": "text", "text": ds[0]["text"]},
            {"type": "audio", "path": ds[0]["audio"]["array"]},
        ],
    },
    {
        "role": f"{ds[1]['speaker_id']}",
        "content": [
            {"type": "text", "text": ds[1]["text"]},
            {"type": "audio", "path": ds[1]["audio"]["array"]},
        ],
    },
    {
        "role": f"{ds[2]['speaker_id']}",
        "content": [
            {"type": "text", "text": ds[2]["text"]},
        ],
    },
]

padded_inputs_1 = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

print("\n" + "="*50)
print("First generation - compiling and recording CUDA graphs...")
with TimerContext("First generation"):
    _ = model.generate(**padded_inputs_1, **gen_kwargs)
print("="*50)

print("\n" + "="*50)
print("Second generation - fast !!!")
with TimerContext("Second generation"):
    _ = model.generate(**padded_inputs_1, **gen_kwargs)
print("="*50)

# now with different inputs
conversation = [
    {
        "role": f"{ds[0]['speaker_id']}",
        "content": [
            {"type": "text", "text": ds[2]["text"]},
            {"type": "audio", "path": ds[2]["audio"]["array"]},
        ],
    },
    {
        "role": f"{ds[1]['speaker_id']}",
        "content": [
            {"type": "text", "text": ds[3]["text"]},
            {"type": "audio", "path": ds[3]["audio"]["array"]},
        ],
    },
    {
        "role": f"{ds[2]['speaker_id']}",
        "content": [
            {"type": "text", "text": ds[4]["text"]},
        ],
    },
]
padded_inputs_2 = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

print("\n" + "="*50)
print("Generation with other inputs!")
with TimerContext("Generation with different inputs"):
    _ = model.generate(**padded_inputs_2, **gen_kwargs)
print("="*50)

微调与训练

from datasets import load_dataset, Audio
from transformers import (
    CsmForConditionalGeneration,
    TrainingArguments,
    CsmProcessor,
    Trainer
)

processor = CsmProcessor.from_pretrained("eustlb/csm-1b")
model = CsmForConditionalGeneration.from_pretrained("eustlb/csm-1b")
model.train()

ds = load_dataset("eustlb/dailytalk-conversations-grouped", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))

def data_collator(samples):
    conversations = [] 

    for sample in samples:
        concatenated_audio_array = sample["audio"]["array"]
        audio = [concatenated_audio_array[s: e] for s, e in sample["audio_cut_idxs"]]
            
        conversation = []
        for speaker_id, text, audio in zip(sample["speaker_ids"], sample["texts"], audio):
            conversation.append({
                "role": f"{speaker_id}",
                "content": [
                    {"type": "text", "text": text},
                    {"type": "audio", "audio": audio}
                ]
            })
            
        conversations.append(conversation)

    inputs = processor.apply_chat_template(
        conversations,
        tokenize=True,
        return_dict=True,
        output_labels=True,
    )
    return inputs

training_args = TrainingArguments(
    "test-trainer",
    remove_unused_columns=False,
    gradient_checkpointing=True,
)

trainer = Trainer(
    model, 
    training_args,
    train_dataset=ds,
    data_collator=data_collator,
)

trainer.train()