🚀 Parler-TTS Mini v1 ft. ParaSpeechCaps
We finetuned a base TTS model on the ParaSpeechCaps dataset to create a TTS model that can generate speech with rich style control using textual prompts.
🚀 Quick Start
We finetuned parler-tts/parler-tts-mini-v1 on our ParaSpeechCaps dataset. This enables the creation of a TTS model capable of generating speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.) through a textual style prompt (e.g., 'A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.').
ParaSpeechCaps (PSC) is our large-scale dataset offering rich style annotations for speech utterances. It supports 59 style tags, covering both speaker - level intrinsic style tags and utterance - level situational style tags. It consists of a human - annotated subset ParaSpeechCaps - Base and a large automatically - annotated subset ParaSpeechCaps - Scaled. Our novel pipeline, which combines off - the - shelf text and speech embedders, classifiers, and an audio language model, allows us to automatically scale rich tag annotations for such a wide variety of style tags for the first time.
For more information, please refer to our paper, codebase, and demo website.
✨ Features
- Rich Style Control: Generate speech with various styles (pitch, rhythm, clarity, emotion, etc.) using textual prompts.
- Large - scale Dataset: Utilize the ParaSpeechCaps dataset with rich style annotations.
- Novel Pipeline: Automatically scale rich tag annotations for a wide range of style tags.
📦 Installation
This repository has been tested with Python 3.11 (conda create -n paraspeechcaps python=3.11
), but most other versions should probably work.
git clone https://github.com/ajd12342/paraspeechcaps.git
cd paraspeechcaps/model/parler-tts
pip install -e .[train]
💻 Usage Examples
Basic Usage
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps"
guidance_scale = 1.5
model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
description_tokenizer = AutoTokenizer.from_pretrained(model_name)
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()
input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)
generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)
For a full inference script that includes ASR - based selection via repeated sampling and other scripts, refer to our codebase.
📚 Documentation
Model Information
Property |
Details |
Model Type |
parler-tts/parler-tts-mini-v1 finetuned on ParaSpeechCaps |
Training Data |
amphion/Emilia-Dataset, ParaSpeechCaps |
Library Name |
transformers |
Pipeline Tag |
text-to-speech |
License |
CC BY-NC SA 4.0 |
📄 License
This project is licensed under the CC BY-NC SA 4.0 license.
📚 Citation
If you use this model, the dataset or the repository, please cite our work as follows:
@misc{diwan2025scalingrichstylepromptedtexttospeech,
title={Scaling Rich Style-Prompted Text-to-Speech Datasets},
author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
year={2025},
eprint={2503.04713},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2503.04713},
}