CSM-Expressiva-1B Open-Source Emotional Speech Model - Free Whisper-Style Speech Synthesis

Csm Expressiva 1b

Developed by senstella

An emotional speech model fine-tuned based on the CSM-1b conversational speech model, supporting whisper-style speech synthesis

Speech Synthesis English#Whisper-style TTS #LoRA fine-tuning optimization #Lightweight training

Downloads 105

Release Time : 4/10/2025

Model Overview

This model fine-tunes the CSM base model through SFT, utilizing whisper-style speech data from the Expresso dataset, validating the LoRA fine-tuning effects of the csm-mlx codebase, capable of generating speech with specific emotional characteristics.

Model Features

Whisper-style speech synthesis

Capable of generating emotional speech with specific whisper-style characteristics

LoRA fine-tuning optimization

Uses Low-Rank Adaptation (LoRA) technology for efficient fine-tuning, adding new features while preserving the base model's capabilities

Lightweight training

Can be trained on a MacBook Air with 16GB memory, suitable for resource-limited environments

Improved stability

Significantly reduces typical base model failures (such as infinite silence) through fine-tuning

Model Capabilities

Text-to-Speech

Emotional speech synthesis

Whisper-style generation

Use Cases

Speech synthesis

Emotional voice assistant

Adds whisper and other emotional speech output capabilities to voice assistants

Capable of generating natural emotional speech

Audio content creation

Provides diverse speech styles for audiobooks, podcasts, and other content creation

Can generate speech content with specific styles

🚀 csm-experssiva

An experimental SFT fine - tune of CSM(Conversational Speech Model) with Expresso's 4th whispering voice. Quick spin - off to see if SFT LoRA tuning of the csm-mlx repository works well.

🚀 Quick Start

The model was trained on a MacBook Air M2 16GB with heavy swap usage, and it took 0:43:47. There are two style checkpoints in the repository. ckpt.pt and ckpt.safetensors are for original PyTorch - based CSM implementations, while mlx-ckpt.safetensors is for the csm-mlx repository.

⚠️ Important Note

Please use the speaker_id 4 while inferencing - since that's what the model was trained with!

For original PyTorch - based CSM implementations, changing the repository name should work as all filenames are identical. For csm-mlx, since the filename is mlx-ckpt.safetensors instead of ckpt.safetensors, you should load the former.

💻 Usage Examples

Basic Usage

from mlx_lm.sample_utils import make_sampler
from huggingface_hub import hf_hub_download
from csm_mlx import CSM, csm_1b, generate

import audiofile
import numpy as np

csm = CSM(csm_1b())
weight = hf_hub_download(repo_id="senstella/csm-expressiva-1b", filename="mlx-ckpt.safetensors") # Here's the difference!
csm.load_weights(weight)

audio = generate(
    csm,
    text="Hello from Sesame.",
    speaker=4, # And this is another difference - please use 4 regardless of where you're inferencing!
    context=[],
    max_audio_length_ms=20_000,
    sampler=make_sampler(temp=0.8, top_k=50)
)

audiofile.write("./audio.wav", np.asarray(audio), 24000)

🔧 Technical Details

Observations

Small - set SFT somewhat mitigates CSM base model failure cases (Non - ending silence etc.). It sometimes still fails, but much less frequently than before SFT tuning.
A small SFT run can easily copy the voice in nice detail.
It seems much stabler when quantized! (This was reported in this PR first!)

Hyperparameters

Property	Details
batch_size	1
epoch	1
first_codebook_weight_multiplier	1.1
learning - rate	1e - 4
weight - decay	1e - 4
optimizer	adamw
lora - rank	8
lora - alpha	16
target - modules	attn, codebook0_head, projection

The future plan is to implement KTO on csm-mlx and further mitigate model failure cases using that approach.

⚠️ Important Note

This model was fine - tuned to investigate whether the CSM - 1b model exhibits emergent capacity to effectively compress and reconstruct whisper - style vocal features - something that traditional TTS models do not usually demonstrate. It also serves as a preliminary verification of the csm - mlx training setup and the correctness of its loss function. The author does not endorse or encourage any inappropriate use of this model. Any unintended associations or interpretations do not reflect the intent behind this model.

📄 License

The license follows Expresso dataset's cc - by - nc - 4.0, since the model is trained from it!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご