🚀 Parler-TTS Mini: Expresso
Parler-TTS Mini: Expresso is a lightweight text-to-speech (TTS) model. It's a fine - tuned version of Parler-TTS Mini v0.1 on the Expresso dataset. This model can generate high - quality, natural - sounding speech and offers better control over emotions and consistent voices compared to the original model.
It's part of the Parler-TTS project's first release, aiming to provide the community with TTS training resources and dataset pre - processing code.
🚀 Quick Start
Using Expresso is straightforward. First, install the library from source:
pip install git+https://github.com/huggingface/parler-tts.git
Then, you can use the model with the following inference snippet:
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, set_seed
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-expresso").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-expresso")
prompt = "Why do you make me do these examples? They're *so* generic."
description = "Thomas speaks moderately slowly in a sad tone with emphasis and high quality audio."
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
set_seed(42)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
✨ Features
- Superior Control: Offers better control over emotions (happy, confused, laughing, sad) and consistent voices (Jerry, Thomas, Elisabeth, Talia) compared to the original model.
- High - Quality Speech: Can generate high - quality, natural - sounding speech.
📦 Installation
Install the library from source using the following command:
pip install git+https://github.com/huggingface/parler-tts.git
💻 Usage Examples
Basic Usage
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, set_seed
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-expresso").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-expresso")
prompt = "Why do you make me do these examples? They're *so* generic."
description = "Thomas speaks moderately slowly in a sad tone with emphasis and high quality audio."
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
set_seed(42)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
Advanced Usage
📚 Documentation
Training Procedure
Expresso is a high - quality, expressive speech dataset with samples from four speakers (two male, two female). By fine - tuning Parler - TTS Mini v0.1 on this dataset, we can train the model to follow emotion and speaker prompts.
To reproduce this fine - tuning run, we need to perform two steps:
- Create text descriptions from the audio samples in the Expresso dataset
- Train the model on the (text, audio) pairs
Step 0: Set - Up
Create a fresh Python environment:
python3 -m venv parler-env
source parler-env/bin/activate
Install PyTorch according to the official instructions. Then install DataSpeech and Parler - TTS sequentially:
git clone git@github.com:huggingface/dataspeech.git && cd dataspeech && pip install -r requirements.txt
cd ..
git clone https://github.com/huggingface/parler-tts.git && cd parler-tts && pip install -e ."[train]"
cd ..
Link your Hugging Face account:
git config --global credential.helper store
huggingface-cli login
Optionally, configure Accelerate:
accelerate config
Optionally, login to Weights and Biases:
wandb login
Step 1: Create Text Descriptions
1.A. Annotate the Expresso dataset
Use the main.py
file from DataSpeech to label continuous variables:
python ./dataspeech/main.py "ylacombe/expresso" \
--configuration "default" \
--text_column_name "text" \
--audio_column_name "audio" \
--cpu_num_workers 8 \
--rename_column \
--repo_id "expresso-tags"
The resulting dataset will be pushed to the Hugging Face Hub.
1.B. Map annotations to text bins
Map continuous variables to discrete ones by binning and assigning text labels. Pass v01_bin_edges.json
as an input argument:
python ./dataspeech/scripts/metadata_to_text.py \
"reach-vb/expresso-tags" \
--repo_id "expresso-tags" \
--configuration "default" \
--cpu_num_workers "8" \
--path_to_bin_edges "./examples/tags_to_annotations/v01_bin_edges.json" \
--avoid_pitch_computation
The resulting dataset will be pushed to the Hugging Face Hub.
1.C. Create natural language descriptions from those text bins
Use the template prompt creation script in Parler - TTS. Download the modified script:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="parler-tts/parler-tts-mini-expresso", filename="run_prompt_creation.py", local_dir="./run_prompt_creation_expresso.py")
Launch prompt creation using the Mistral Instruct 7B model:
accelerate launch ./dataspeech/run_prompt_creation_expresso.py \
--dataset_name "reach-vb/expresso-tags" \
--dataset_config_name "default" \
--model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
--per_device_eval_batch_size 32 \
--attn_implementation "sdpa" \
--dataloader_num_workers 8 \
--output_dir "./tmp_expresso" \
--load_in_4bit \
--push_to_hub \
--hub_dataset_id "expresso-tagged-w-speech-mistral" \
--preprocessing_num_workers 16
Step 2: Fine - Tune the Model
Fine - tune the model using the Parler - TTS training script run_parler_tts_training.py. Fine - tune on a combination of three datasets:
- Expresso
- Jenny
- LibriTTS - R
accelerate launch ./training/run_parler_tts_training.py \
--model_name_or_path "parler-tts/parler_tts_mini_v0.1" \
--feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
--description_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
--prompt_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
📄 License
This project is licensed under the apache - 2.0
license.
Property |
Details |
Model Type |
Text - to - Speech |
Training Data |
ylacombe/expresso, reach - vb/jenny_tts_dataset, blabble - io/libritts_r |
💡 Usage Tip
- Specify the name of a male speaker (Jerry, Thomas) or female speaker (Talia, Elisabeth) for consistent voices.
- The model can generate in a range of emotions, including: "happy", "confused", "default" (meaning no particular emotion conveyed), "laughing", "sad", "whisper", "emphasis".
- Include the term "high quality audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise.
- Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech.
- To emphasise particular words, wrap them in asterisk (e.g.
*you*
) and include "emphasis" in the prompt.