🚀 Indic Parler-TTS Pretrained
Indic Parler-TTS Pretrained is a multilingual Indic extension of Parler-TTS Mini. It was trained on an 8,385-hour multilingual Indic and English dataset and is released alongside its fine-tuned version: Indic Parler-TTS. This model can officially speak in 21 languages, including 20 Indic languages and English, and thanks to its better prompt tokenizer, it can easily be extended to other languages.
🚀 Quick Start
👨💻 Installation
Using Parler-TTS is as simple as "bonjour". Simply install the library once:
pip install git+https://github.com/huggingface/parler-tts.git
✨ Features
🛠️ Key capabilities
The model accepts two primary inputs:
- Transcript - The text to be converted to speech.
- Caption - A detailed description of how the speech should sound, e.g., "Leela speaks in a high-pitched, fast-paced, and cheerful tone, full of energy and happiness. The recording is very high quality with no background noise."
Key Features
- Language Support
- Officially supported languages: Assamese, Bengali, Bodo, Dogri, Kannada, Malayalam, Marathi, Sanskrit, Nepali, English, Telugu, Hindi, Gujarati, Konkani, Maithili, Manipuri, Odia, Santali, Sindhi, Tamil, and Urdu.
- Unofficial support: Chhattisgarhi, Kashmiri, Punjabi.
- Speaker Diversity
- 69 unique voices across the supported languages.
- Supported languages have a set of recommended voices optimized for naturalness and intelligibility.
- Emotion Rendering
- 10 languages officially support emotion-specific prompts: Assamese, Bengali, Bodo, Dogri, Kannada, Malayalam, Marathi, Sanskrit, Nepali, and Tamil.
- Emotion support for other languages exists but has not been extensively tested.
- Available emotions include: Command, Anger, Narration, Conversation, Disgust, Fear, Happy, Neutral, Proper Noun, News, Sad, and Surprise.
- Accent Flexibility
- The model officially supports Indian English accents through its English voices, providing clear and natural speech.
- For other accents, the model allows customization by specifying accent details, such as "A male British speaker" or "A female American speaker," using style transfer for more dynamic and personalized outputs.
- Customizable Output
Indic Parler-TTS Pretrained offers precise control over various speech characteristics using the caption input:
- Background Noise: Adjust the noise level in the audio, from clear to slightly noisy environments.
- Reverberation: Control the perceived distance of the voice, from close-sounding to distant-sounding speech.
- Expressivity: Specify how dynamic or monotone the speech should be, ranging from expressive to slightly expressive or monotone.
- Pitch: Modify the pitch of the speech, including high, low, or balanced tones.
- Speaking Rate: Change the speaking rate, from slow to fast.
- Voice Quality: Control the overall clarity and naturalness of the speech, adjusting from basic to refined voice quality.
💻 Usage Examples
Basic Usage
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts-pretrained").to(device)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts-pretrained")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
prompt = "Hey, how are you doing today?"
description = "A female speaker with a British accent delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
description_input_ids = description_tokenizer(description, return_tensors="pt").to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device)
generation = model.generate(input_ids=description_input_ids.input_ids, attention_mask=description_input_ids.attention_mask, prompt_input_ids=prompt_input_ids.input_ids, prompt_attention_mask=prompt_input_ids.attention_mask)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("indic_tts_out.wav", audio_arr, model.config.sampling_rate)
Advanced Usage
🎲 Random voice
Indic Parler-TTS Pretrained provides highly effective control over key aspects of speech synthesis using descriptive captions. Below is a summary of what each control parameter can achieve:
Control Type |
Capabilities |
Background Noise |
Adjusts the level of background noise, supporting clear and slightly noisy environments. |
Reverberation |
Controls the perceived distance of the speaker’s voice, allowing close or distant sounds. |
Expressivity |
Modulates the emotional intensity of speech, from monotone to highly expressive. |
Pitch |
Varies the pitch to achieve high, low, or moderate tonal output. |
Speaking Rate |
Changes the speed of speech delivery, ranging from slow to fast-paced. |
Speech Quality |
Improves or degrades the overall audio clarity, supporting basic to refined outputs. |
🌍 Switching languages
The model automatically adapts to the language it detects in the prompt. You don't need to specify the language you want to use. For example, to switch to Hindi, simply use an Hindi prompt:
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts-pretrained").to(device)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts-pretrained")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
prompt = "अरे, तुम आज कैसे हो?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
description_input_ids = description_tokenizer(description, return_tensors="pt").to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device)
generation = model.generate(input_ids=description_input_ids.input_ids, attention_mask=description_input_ids.attention_mask, prompt_input_ids=prompt_input_ids.input_ids, prompt_attention_mask=prompt_input_ids.attention_mask)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("indic_tts_out.wav", audio_arr, model.config.sampling_rate)
🎯 Using a specific speaker
To ensure speaker consistency across generations, this checkpoint was also trained on pre-determined speakers, characterized by name (e.g. Rohit, Karan, Leela, Maya, Sita, ...). To take advantage of this, simply adapt your text description to specify which speaker to use: Divya's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts-pretrained").to(device)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts-pretrained")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
prompt = "अरे, तुम आज कैसे हो?"
description = "Divya's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
description_input_ids = description_tokenizer(description, return_tensors="pt").to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device)
generation = model.generate(input_ids=description_input_ids.input_ids, attention_mask=description_input_ids.attention_mask, prompt_input_ids=prompt_input_ids.input_ids, prompt_attention_mask=prompt_input_ids.attention_mask)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("indic_tts_out.wav", audio_arr, model.config.sampling_rate)
The model includes 69 speakers across 18 officially supported languages, with each language having a set of recommended voices for optimal performance. Below is a table summarizing the available speakers for each language, along with the recommended ones.
Language |
Available Speakers |
Recommended Speakers |
Assamese |
Amit, Sita, Poonam, Rakesh |
Amit, Sita |
Bengali |
Arjun, Aditi, Tapan, Rashmi, Arnav, Riya |
Arjun, Aditi |
Bodo |
Bikram, Maya, Kalpana |
Bikram, Maya |
Chhattisgarhi |
Bhanu, Champa |
Bhanu, Champa |
Dogri |
Karan |
Karan |
English |
Thoma, Mary, Swapna, Dinesh, Meera, Jatin, Aakash, Sneha, Kabir, Tisha, Chingkhei, Thoiba, Priya, Tarun, Gauri, Nisha, Raghav, Kavya, Ravi, Vikas, Riya |
Thoma, Mary |
Gujarati |
Yash, Neha |
Yash, Neha |
Hindi |
Rohit, Divya, Aman, Rani |
Rohit, Divya |
Kannada |
Suresh, Anu, Chetan, Vidya |
Suresh, Anu |
Malayalam |
Anjali, Anju, Harish |
Anjali, Harish |
Manipuri |
Laishram, Ranjit |
Laishram, Ranjit |
Marathi |
Sanjay, Sunita, Nikhil, Radha, Varun, Isha |
Sanjay, Sunita |
Nepali |
Amrita |
Amrita |
Odia |
Manas, Debjani |
Manas, Debjani |
Punjabi |
Divjot, Gurpreet |
Divjot, Gurpreet |
📄 License
This project is licensed under the apache-2.0
license.