parler-tts-mini-v1-paraspeechcaps-only-base Open-source TTS Model - Achieve Diverse Voice Styles with Text Prompts

Home

Parler Tts Mini V1 Paraspeechcaps Only Base

Developed by ajd12342

A text-to-speech model capable of controlling rich speech styles through textual style prompts

Speech Synthesis

Transformers

English#Style-Controllable Speech Synthesis #Multi-Dimensional Voice Control #Human-Annotated Dataset

Downloads 17

Release Time : 2/28/2025

Model Overview

This model is a fine-tuned text-to-speech model based on the ParaSpeechCaps-Base dataset, capable of controlling speech features such as pitch, rhythm, clarity, and emotion through style prompts.

Model Features

Rich Style Control

Precisely control speech features such as pitch, rhythm, clarity, and emotion through text prompts

High-Quality Speech Generation

Fine-tuned on a human-annotated dataset, generating high-quality speech

Diverse Style Labels

Supports 59 style labels, covering speaker-inherent styles and contextual sentence styles

Model Capabilities

Text-to-Speech

Speech Style Control

Emotional Speech Synthesis

Use Cases

Speech Synthesis Applications

Audiobook Generation

Generate expressive audiobooks based on text content and emotional prompts

Voice Assistants

Provide more natural and emotionally rich voice output for voice assistants

Assistive Technologies

Visual Impairment Assistance

Provide more natural and comprehensible voice output for visually impaired users

🚀 Parler-TTS Mini v1 ft. ParaSpeechCaps-Base

This project finetunes a TTS model on a large-scale speech dataset with rich style annotations, enabling control of speech styles using textual prompts.

🚀 Quick Start

We finetuned parler-tts/parler-tts-mini-v1 on the human-annotated Base subset of our ParaSpeechCaps dataset. This creates a TTS model that can generate speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.) with a textual style prompt (e.g., 'A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.').

For our improved model finetuned on the entirety of ParaSpeechCaps, please check out ajd12342/parler-tts-mini-v1-paraspeechcaps.

ParaSpeechCaps (PSC) is our large-scale dataset that provides rich style annotations for speech utterances. It supports 59 style tags covering speaker-level intrinsic style tags and utterance-level situational style tags. It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled. Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations for such a wide variety of style tags for the first time.

Please take a look at our paper, our codebase and our demo website for more information.

✨ Features

Rich Style Control: Generate speech with various styles using textual prompts.
Large-scale Dataset: Utilize ParaSpeechCaps, a large-scale dataset with rich style annotations.
Novel Pipeline: Automatically scale rich tag annotations for a wide variety of style tags.

📦 Installation

This repository has been tested with Python 3.11 (conda create -n paraspeechcaps python=3.11), but most other versions should probably work.

git clone https://github.com/ajd12342/paraspeechcaps.git
cd paraspeechcaps/model/parler-tts
pip install -e .[train]

💻 Usage Examples

Basic Usage

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps-only-base"
guidance_scale = 1.5

model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
description_tokenizer = AutoTokenizer.from_pretrained(model_name)
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()

input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)

generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)

audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)

For a full inference script that includes ASR-based selection via repeated sampling and other scripts, refer to our codebase.

📚 Documentation

ParaSpeechCaps (PSC) is a large-scale dataset that provides rich style annotations for speech utterances. It supports 59 style tags covering speaker-level intrinsic style tags and utterance-level situational style tags. It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled. Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations for such a wide variety of style tags for the first time.

🔧 Technical Details

Our novel pipeline combines off-the-shelf text and speech embedders, classifiers and an audio language model. This allows us to automatically scale rich tag annotations for a wide variety of style tags for the first time. The pipeline is designed to handle the large-scale ParaSpeechCaps dataset, which consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled.

📄 License

License: CC BY-NC SA 4.0

📚 Citation

If you use this model, the dataset or the repository, please cite our work as follows:

@misc{diwan2025scalingrichstylepromptedtexttospeech,
      title={Scaling Rich Style-Prompted Text-to-Speech Datasets}, 
      author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
      year={2025},
      eprint={2503.04713},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2503.04713}, 
}

Property	Details
Base Model	parler-tts/parler-tts-mini-v1
Language	en
Library Name	transformers
License	cc-by-nc-sa-4.0
Pipeline Tag	text-to-speech

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご