csm-1b-hf開源語音模型 - 免費實現文本轉語音與語音克隆應用

首頁

Csm 1b Hf

由thomasgauthier開發

芝麻科技會話語音模型(CSM)的Hugging Face實現版本，支持文本轉語音和語音克隆任務

語音合成

Transformers

開源協議:Apache-2.0 #語音克隆 #多碼本音頻生成 #自迴歸語音合成

下載量 3,974

發布時間 : 3/26/2025

模型概述

這是芝麻科技CSM 1B模型的Hugging Face兼容版本，完全重寫了官方實現，支持與Hugging Face生態系統的全面集成，包括推理和訓練流程。

模型特點

Hugging Face兼容

完全重寫實現以兼容Hugging Face生態系統，支持transformers庫的所有功能

兩階段自迴歸架構

採用幀間處理和幀內處理的雙階段設計，有效建模長距離依賴關係

計算攤銷訓練

採用解碼器訓練攤銷技術，僅訓練部分幀的1-31碼本，提高訓練效率

多模態輸入支持

支持處理交錯的文本和音頻輸入數據

模型能力

文本轉語音合成

語音克隆

多碼本音頻標記化

長距離語音建模

使用案例

語音合成

個性化語音助手

為虛擬助手生成自然的人聲響應

可生成帶有特定說話者特徵的語音

語音內容創作

將文本內容自動轉換為語音

支持高質量語音輸出

語音克隆

個性化語音克隆

基於少量樣本克隆特定說話者的聲音特徵

示例顯示可成功克隆說話者聲音

🚀 CSM-1B-HF

CSM-1B-HF是一個基於Hugging Face實現的語音模型，它能將文本轉化為語音，為語音合成領域提供了新的解決方案。

🚀 快速開始

CSM-HF是 Sesame的對話語音模型（CSM）在Hugging Face上的實現。它完全重寫了 Sesame提供的PyTorch代碼，並且從推理到訓練都與Hugging Face的 transformers 庫完全兼容。

✨ 主要特性

創建了 CSMModel 類。
用HF transformers的 LllamaModel 替換了骨幹網絡和解碼器的TorchTune模型。
添加了一個處理器類，用於為模型準備輸入。
增加了標籤支持和解碼器訓練攤銷。
為模型類添加了 generate_frame 和 generate 方法，用於生成音頻。
完全支持HuggingFace的 Trainer。

💻 使用示例

基礎用法

你可以使用該模型從文本輸入生成音頻。以下是一個語音克隆的示例：

import torch
from modeling_csm import CSMModel
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
from moshi.models import loaders
from processor import CSMProcessor
import torchaudio

device = 'cuda'

def load_llama3_tokenizer():
    """
    https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992
    """
    tokenizer_name = "meta-llama/Llama-3.2-1B"
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    bos = tokenizer.bos_token
    eos = tokenizer.eos_token
    tokenizer._tokenizer.post_processor = TemplateProcessing(
        single=f"{bos}:0 $A:0 {eos}:0",
        pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
        special_tokens=[(f"{bos}", tokenizer.bos_token_id), (f"{eos}", tokenizer.eos_token_id)],
    )

    return tokenizer

text_tokenizer = load_llama3_tokenizer()

mimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
audio_tokenizer = loaders.get_mimi(mimi_weight, device=device)
audio_tokenizer.set_num_codebooks(32)

processor = CSMProcessor(text_tokenizer, audio_tokenizer)


def load_audio(path, target_sr):
    audio, sr = torchaudio.load(path)
    audio = audio.squeeze(0)
    if sr != target_sr:
        audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=target_sr)
    return audio


model = CSMModel.from_pretrained("thomasgauthier/csm-1b-hf", torch_dtype=torch.bfloat16)
model.to('cuda')


inputs = processor(
    messages=[
        {
        "role": "speaker_0",
        "content": [
            {"type": "text", "text": "<AUDIO_CLIP_TRANSCRIPT>"},
            {"type": "audio"} # This placeholder is required for audio tokenization (it maps to the first element in the `audios` list passed to the processor)
        ]
    },
            {
        "role": "speaker_0",
        "content": [
            {"type": "text", "text": "Hello, this is voice cloning speaking"},
            # does not include audio as the model will generate it
        ]
    }
        ], 
    audios=[load_audio('AUDIO_CLIP_FOR_VOICE_CLONING.wav', audio_tokenizer.sample_rate)],
    return_tensors="pt"
)

import torch

with torch.inference_mode():
    # Generate up to 50 new frames
    gen_frames = model.generate(
        input_ids=inputs['input_ids'].cuda(),
        attention_mask=inputs['attention_mask'].cuda(),
        max_new_frames=50,
        topk=50,
        temperature=1.0,
        use_cache=True,
        stop_on_all_zeros=True,

    )

decoded_audio = audio_tokenizer.decode(gen_frames.permute(0, 2, 1)).squeeze(0).squeeze(0)

audio_array = (decoded_audio * 32768).to(torch.int16).cpu().numpy()

# Audio can be played with the following code:
# from IPython.display import Audio
# Audio(audio_array, rate=audio_tokenizer.sample_rate)

📚 詳細文檔

架構

模型架構在 ARCHITECTURE.md 中進行了討論（由O1編寫）。

訓練

數據格式

CSM-HF期望訓練數據採用JSONL格式，其中每行是一個包含對話的JSON對象。每個對話由以下部分組成：

messages：消息對象數組，每個對象包含：
- role：說話者標識符（例如，"speaker_0"，"speaker_1"）
- content：內容對象數組，可以是：
  - 文本：{"type": "text", "text": "消息文本"}
  - 音頻：{"type": "audio", "url": "音頻文件路徑.wav"}
training_mask：布爾數組，指示哪些消息應用於訓練（true）或作為上下文（false）

示例數據格式：

{
  "messages": [
    {
      "role": "speaker_0",
      "content": [
        {"type": "text", "text": "We have a chance for a new life here."},
        {"type": "audio", "url": "clips/example_audio.wav"}
      ]
    },
    {
      "role": "speaker_1",
      "content": [
        {"type": "text", "text": "Uncle?"},
        {"type": "audio", "url": "clips/response_audio.wav"}
      ]
    }
  ],
  "training_mask": [false, true]
}

訓練過程

該模型採用兩階段自迴歸架構：

骨幹網絡（幀間處理）：
- 處理整個幀序列
- 每個幀代表所有碼本的組合嵌入
- 處理話語之間的長距離依賴關係
解碼器（幀內處理）：
- 一次處理一個幀
- 按順序生成32個碼本（1個語義碼本 + 31個聲學碼本）
- 每個碼本被視為序列中的一個標記

訓練利用計算攤銷技術：

第零個（語義）碼本在所有幀上進行訓練
其餘碼本（1 - 31）僅在 amortization_ratio 的幀上進行訓練
這在保持質量的同時顯著減少了內存使用

要訓練模型，請運行以下命令：

python train.py \
  --train_file path/to/training_data.jsonl \
  --output_dir ./output \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --learning_rate 5e-6