Parler-tts-mini-expresso Open-source TTS Model - Free Realization of Emotion and Speaker-controllable Text-to-Speech

Parler Tts Mini Expresso

Developed by parler-tts

Parler-TTS Mini: Expresso is a lightweight text-to-speech model fine-tuned on the Expresso dataset based on Parler-TTS Mini v0.1, supporting emotion and speaker control.

Speech Synthesis

Transformers

EnglishOpen Source License:Apache-2.0 #Emotional Speech Synthesis #Multi-speaker Support #High-quality Audio Generation

Downloads 1,489

Release Time : 5/15/2024

Model Overview

This is a high-quality text-to-speech model capable of generating natural and fluent speech, specifically optimized for controlling emotions (happy, confused, laughing, sad, etc.) and consistent voices (Jerry, Thomas, Elisabeth, Talia).

Model Features

Emotion Control

Supports generating speech with various emotions, including happiness, confusion, laughter, sadness, etc.

Speaker Consistency

Can generate speech for four different speakers (two male and two female) while maintaining voice consistency.

High-quality Audio

Generates professional studio-quality speech output.

Prosody Control

Controls speech prosody and emphasis through punctuation and special markers.

Model Capabilities

Text-to-Speech

Emotional Speech Generation

Multi-speaker Speech Generation

Prosody Control

Use Cases

Voice Assistants

Emotional Voice Assistant

Add emotional expressions to voice assistants to enhance user experience.

Generates more natural and expressive voice feedback.

Audio Content Creation

Audiobook Narration

Generates speech with different characters and emotions for audiobooks.

Creates a more vivid audio content experience.

Assistive Technology

Visual Impairment Assistance

Generates expressive speech content for visually impaired users.

Enhances information delivery effectiveness and user experience.

🚀 Parler-TTS Mini: Expresso

Parler-TTS Mini: Expresso is a lightweight text-to-speech (TTS) model. It's a fine - tuned version of Parler-TTS Mini v0.1 on the Expresso dataset. This model can generate high - quality, natural - sounding speech and offers better control over emotions and consistent voices compared to the original model.

It's part of the Parler-TTS project's first release, aiming to provide the community with TTS training resources and dataset pre - processing code.

🚀 Quick Start

Using Expresso is straightforward. First, install the library from source:

pip install git+https://github.com/huggingface/parler-tts.git

Then, you can use the model with the following inference snippet:

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, set_seed
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-expresso").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-expresso")

prompt = "Why do you make me do these examples? They're *so* generic."
description = "Thomas speaks moderately slowly in a sad tone with emphasis and high quality audio."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

set_seed(42)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

✨ Features

Superior Control: Offers better control over emotions (happy, confused, laughing, sad) and consistent voices (Jerry, Thomas, Elisabeth, Talia) compared to the original model.
High - Quality Speech: Can generate high - quality, natural - sounding speech.

📦 Installation

Install the library from source using the following command:

pip install git+https://github.com/huggingface/parler-tts.git

💻 Usage Examples

Basic Usage

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, set_seed
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-expresso").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-expresso")

prompt = "Why do you make me do these examples? They're *so* generic."
description = "Thomas speaks moderately slowly in a sad tone with emphasis and high quality audio."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

set_seed(42)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Advanced Usage

# Tips for advanced usage
# Specify the name of a male speaker (Jerry, Thomas) or female speaker (Talia, Elisabeth) for consistent voices
# The model can generate in a range of emotions, including: "happy", "confused", "default" (meaning no particular emotion conveyed), "laughing", "sad", "whisper", "emphasis"
# Include the term "high quality audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
# Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
# To emphasise particular words, wrap them in asterisk (e.g. `*you*` in the example above) and include "emphasis" in the prompt

📚 Documentation

Training Procedure

Expresso is a high - quality, expressive speech dataset with samples from four speakers (two male, two female). By fine - tuning Parler - TTS Mini v0.1 on this dataset, we can train the model to follow emotion and speaker prompts.

To reproduce this fine - tuning run, we need to perform two steps:

Create text descriptions from the audio samples in the Expresso dataset
Train the model on the (text, audio) pairs

Step 0: Set - Up

Create a fresh Python environment:

python3 -m venv parler-env
source parler-env/bin/activate

Install PyTorch according to the official instructions. Then install DataSpeech and Parler - TTS sequentially:

git clone git@github.com:huggingface/dataspeech.git && cd dataspeech && pip install -r requirements.txt
cd ..
git clone https://github.com/huggingface/parler-tts.git && cd parler-tts && pip install -e ."[train]"
cd ..

Link your Hugging Face account:

git config --global credential.helper store
huggingface-cli login

Optionally, configure Accelerate:

accelerate config

Optionally, login to Weights and Biases:

wandb login

Step 1: Create Text Descriptions

1.A. Annotate the Expresso dataset

Use the main.py file from DataSpeech to label continuous variables:

python ./dataspeech/main.py "ylacombe/expresso" \
  --configuration "default" \
  --text_column_name "text" \
  --audio_column_name "audio" \
  --cpu_num_workers 8 \
  --rename_column \
  --repo_id "expresso-tags"

The resulting dataset will be pushed to the Hugging Face Hub.

1.B. Map annotations to text bins

Map continuous variables to discrete ones by binning and assigning text labels. Pass v01_bin_edges.json as an input argument:

python ./dataspeech/scripts/metadata_to_text.py \
    "reach-vb/expresso-tags" \
    --repo_id "expresso-tags" \
    --configuration "default" \
    --cpu_num_workers "8" \
    --path_to_bin_edges "./examples/tags_to_annotations/v01_bin_edges.json" \
    --avoid_pitch_computation

The resulting dataset will be pushed to the Hugging Face Hub.

1.C. Create natural language descriptions from those text bins

Use the template prompt creation script in Parler - TTS. Download the modified script:

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="parler-tts/parler-tts-mini-expresso", filename="run_prompt_creation.py", local_dir="./run_prompt_creation_expresso.py")

Launch prompt creation using the Mistral Instruct 7B model:

accelerate launch ./dataspeech/run_prompt_creation_expresso.py \
  --dataset_name "reach-vb/expresso-tags" \
  --dataset_config_name "default" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 32 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 8 \
  --output_dir "./tmp_expresso" \
  --load_in_4bit \
  --push_to_hub \
  --hub_dataset_id "expresso-tagged-w-speech-mistral" \
  --preprocessing_num_workers 16

Step 2: Fine - Tune the Model

Fine - tune the model using the Parler - TTS training script run_parler_tts_training.py. Fine - tune on a combination of three datasets:

accelerate launch ./training/run_parler_tts_training.py \
    --model_name_or_path "parler-tts/parler_tts_mini_v0.1" \
    --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
    --description_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
    --prompt_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \

📄 License

This project is licensed under the apache - 2.0 license.

Property	Details
Model Type	Text - to - Speech
Training Data	ylacombe/expresso, reach - vb/jenny_tts_dataset, blabble - io/libritts_r

💡 Usage Tip

Specify the name of a male speaker (Jerry, Thomas) or female speaker (Talia, Elisabeth) for consistent voices.

The model can generate in a range of emotions, including: "happy", "confused", "default" (meaning no particular emotion conveyed), "laughing", "sad", "whisper", "emphasis".

Include the term "high quality audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise.

Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech.

To emphasise particular words, wrap them in asterisk (e.g. *you*) and include "emphasis" in the prompt.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご