Parler-TTS Mini v1 Fine-tuned Model - Open-source Text-to-Speech, Support Style Prompt to Control Output

Home

Parler Tts Mini V1 Paraspeechcaps

Developed by ajd12342

A fine-tuned text-to-speech model based on Parler-TTS Mini v1, supporting voice output control via style prompts

Speech Synthesis

Transformers

English#Style-Controlled TTS #Multi-Label Speech Generation #Emotional Speech Synthesis

Downloads 139

Release Time : 2/27/2025

Model Overview

This model is fine-tuned on the ParaSpeechCaps dataset and can generate richly styled speech outputs through text style prompts (such as pitch, rhythm, clarity, emotion, etc.).

Model Features

Style Control

Supports precise control of voice output style features (such as pitch, rhythm, emotion, etc.) through text prompts

Large-Scale Style Annotation

Trained on the ParaSpeechCaps dataset, which includes rich annotations for 59 style labels

Multimodal Training

Novel training pipeline combining text and speech encoders, classifiers, and audio language models

Model Capabilities

Text-to-Speech

Speech Style Control

Multi-Style Speech Generation

Use Cases

Speech Synthesis

Emotional Speech Generation

Generates speech with specific emotions based on text prompts

Can produce speech outputs with different emotions such as sadness, happiness, etc.

Stylized Voice Creation

Creates voices with specific styles for films, games, etc.

Can control parameters like speech rate and clarity to generate professional-grade voices

Assistive Technology

Accessible Speech Synthesis

Provides customizable voice outputs for visually impaired users

Can adjust voice features according to user preferences

🚀 Parler-TTS Mini v1 ft. ParaSpeechCaps

We finetuned a base TTS model on the ParaSpeechCaps dataset to create a TTS model that can generate speech with rich style control using textual prompts.

🚀 Quick Start

We finetuned parler-tts/parler-tts-mini-v1 on our ParaSpeechCaps dataset. This enables the creation of a TTS model capable of generating speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.) through a textual style prompt (e.g., 'A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.').

ParaSpeechCaps (PSC) is our large-scale dataset offering rich style annotations for speech utterances. It supports 59 style tags, covering both speaker - level intrinsic style tags and utterance - level situational style tags. It consists of a human - annotated subset ParaSpeechCaps - Base and a large automatically - annotated subset ParaSpeechCaps - Scaled. Our novel pipeline, which combines off - the - shelf text and speech embedders, classifiers, and an audio language model, allows us to automatically scale rich tag annotations for such a wide variety of style tags for the first time.

For more information, please refer to our paper, codebase, and demo website.

✨ Features

Rich Style Control: Generate speech with various styles (pitch, rhythm, clarity, emotion, etc.) using textual prompts.
Large - scale Dataset: Utilize the ParaSpeechCaps dataset with rich style annotations.
Novel Pipeline: Automatically scale rich tag annotations for a wide range of style tags.

📦 Installation

This repository has been tested with Python 3.11 (conda create -n paraspeechcaps python=3.11), but most other versions should probably work.

git clone https://github.com/ajd12342/paraspeechcaps.git
cd paraspeechcaps/model/parler-tts
pip install -e .[train]

💻 Usage Examples

Basic Usage

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps"
guidance_scale = 1.5

model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
description_tokenizer = AutoTokenizer.from_pretrained(model_name)
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()

input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)

generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)

audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)

For a full inference script that includes ASR - based selection via repeated sampling and other scripts, refer to our codebase.

📚 Documentation

Model Information

Property	Details
Model Type	parler-tts/parler-tts-mini-v1 finetuned on ParaSpeechCaps
Training Data	amphion/Emilia-Dataset, ParaSpeechCaps
Library Name	transformers
Pipeline Tag	text-to-speech
License	CC BY-NC SA 4.0

📄 License

This project is licensed under the CC BY-NC SA 4.0 license.

📚 Citation

If you use this model, the dataset or the repository, please cite our work as follows:

@misc{diwan2025scalingrichstylepromptedtexttospeech,
      title={Scaling Rich Style-Prompted Text-to-Speech Datasets}, 
      author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
      year={2025},
      eprint={2503.04713},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2503.04713}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご