Indic Parler-TTS Pretrained Model - An Open-Source Multilingual Indian Speech Synthesis Tool Supporting 21 Languages

Indic Parler Tts Pretrained

Developed by ai4bharat

The Indic Parler-TTS Pretrained Model is a multilingual Indian language extension of Parler-TTS Mini, supporting 21 languages, including various Indian languages and English.

Speech Synthesis

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Speech Synthesis #Indian Language Support #Emotion Control

Downloads 1,102

Release Time : 11/4/2024

Model Overview

This model is trained on an 8,385-hour multilingual Indian language and English dataset, supporting text-to-speech tasks and capable of generating speech outputs with specific vocal characteristics.

Model Features

Multilingual Support

Officially supports 21 languages, including various Indian languages and English.

Speaker Diversity

Supports 69 unique voices, with recommended voices for each language optimized for naturalness and intelligibility.

Emotional Rendering

Officially supports emotion-specific prompts in 10 languages, including commands, anger, narration, and more.

Accent Flexibility

Supports Indian English accents and allows customization of other accents through descriptions.

Customizable Output

Provides precise control over background noise, reverb, expressiveness, pitch, speech rate, and voice quality through descriptive inputs.

Model Capabilities

Text-to-Speech

Multilingual Speech Synthesis

Emotional Speech Synthesis

Speaker Customization

Voice Feature Control

Use Cases

Regional Language Technology

Indian Language Speech Synthesis

Provides high-quality speech synthesis services for various Indian regional languages.

Generates natural and intelligible speech output

Education

Multilingual Educational Materials

Generates multilingual speech content for educational applications.

Supports educational resources in multiple Indian languages

Entertainment

Audiobooks

Generates audiobook content with specific emotions and speaker characteristics.

Rich and diverse vocal performances

🚀 Indic Parler-TTS Pretrained

Indic Parler-TTS Pretrained is a multilingual Indic extension of Parler-TTS Mini. It was trained on an 8,385-hour multilingual Indic and English dataset and is released alongside its fine-tuned version: Indic Parler-TTS. This model can officially speak in 21 languages, including 20 Indic languages and English, and thanks to its better prompt tokenizer, it can easily be extended to other languages.

🚀 Quick Start

👨‍💻 Installation

Using Parler-TTS is as simple as "bonjour". Simply install the library once:

pip install git+https://github.com/huggingface/parler-tts.git

✨ Features

🛠️ Key capabilities

The model accepts two primary inputs:

Transcript - The text to be converted to speech.
Caption - A detailed description of how the speech should sound, e.g., "Leela speaks in a high-pitched, fast-paced, and cheerful tone, full of energy and happiness. The recording is very high quality with no background noise."

Key Features

Language Support
- Officially supported languages: Assamese, Bengali, Bodo, Dogri, Kannada, Malayalam, Marathi, Sanskrit, Nepali, English, Telugu, Hindi, Gujarati, Konkani, Maithili, Manipuri, Odia, Santali, Sindhi, Tamil, and Urdu.
- Unofficial support: Chhattisgarhi, Kashmiri, Punjabi.
Speaker Diversity
- 69 unique voices across the supported languages.
- Supported languages have a set of recommended voices optimized for naturalness and intelligibility.
Emotion Rendering
- 10 languages officially support emotion-specific prompts: Assamese, Bengali, Bodo, Dogri, Kannada, Malayalam, Marathi, Sanskrit, Nepali, and Tamil.
- Emotion support for other languages exists but has not been extensively tested.
- Available emotions include: Command, Anger, Narration, Conversation, Disgust, Fear, Happy, Neutral, Proper Noun, News, Sad, and Surprise.
Accent Flexibility
- The model officially supports Indian English accents through its English voices, providing clear and natural speech.
- For other accents, the model allows customization by specifying accent details, such as "A male British speaker" or "A female American speaker," using style transfer for more dynamic and personalized outputs.
Customizable Output Indic Parler-TTS Pretrained offers precise control over various speech characteristics using the caption input:
- Background Noise: Adjust the noise level in the audio, from clear to slightly noisy environments.
- Reverberation: Control the perceived distance of the voice, from close-sounding to distant-sounding speech.
- Expressivity: Specify how dynamic or monotone the speech should be, ranging from expressive to slightly expressive or monotone.
- Pitch: Modify the pitch of the speech, including high, low, or balanced tones.
- Speaking Rate: Change the speaking rate, from slow to fast.
- Voice Quality: Control the overall clarity and naturalness of the speech, adjusting from basic to refined voice quality.

💻 Usage Examples

Basic Usage

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts-pretrained").to(device)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts-pretrained")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "Hey, how are you doing today?"
description = "A female speaker with a British accent delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

description_input_ids = description_tokenizer(description, return_tensors="pt").to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device)

generation = model.generate(input_ids=description_input_ids.input_ids, attention_mask=description_input_ids.attention_mask, prompt_input_ids=prompt_input_ids.input_ids, prompt_attention_mask=prompt_input_ids.attention_mask)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("indic_tts_out.wav", audio_arr, model.config.sampling_rate)

Advanced Usage

🎲 Random voice

Indic Parler-TTS Pretrained provides highly effective control over key aspects of speech synthesis using descriptive captions. Below is a summary of what each control parameter can achieve:

Control Type	Capabilities
Background Noise	Adjusts the level of background noise, supporting clear and slightly noisy environments.
Reverberation	Controls the perceived distance of the speaker’s voice, allowing close or distant sounds.
Expressivity	Modulates the emotional intensity of speech, from monotone to highly expressive.
Pitch	Varies the pitch to achieve high, low, or moderate tonal output.
Speaking Rate	Changes the speed of speech delivery, ranging from slow to fast-paced.
Speech Quality	Improves or degrades the overall audio clarity, supporting basic to refined outputs.

🌍 Switching languages

The model automatically adapts to the language it detects in the prompt. You don't need to specify the language you want to use. For example, to switch to Hindi, simply use an Hindi prompt:

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts-pretrained").to(device)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts-pretrained")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "अरे, तुम आज कैसे हो?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

description_input_ids = description_tokenizer(description, return_tensors="pt").to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device)

generation = model.generate(input_ids=description_input_ids.input_ids, attention_mask=description_input_ids.attention_mask, prompt_input_ids=prompt_input_ids.input_ids, prompt_attention_mask=prompt_input_ids.attention_mask)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("indic_tts_out.wav", audio_arr, model.config.sampling_rate)

🎯 Using a specific speaker

To ensure speaker consistency across generations, this checkpoint was also trained on pre-determined speakers, characterized by name (e.g. Rohit, Karan, Leela, Maya, Sita, ...). To take advantage of this, simply adapt your text description to specify which speaker to use: Divya's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts-pretrained").to(device)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts-pretrained")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "अरे, तुम आज कैसे हो?"
description = "Divya's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."

description_input_ids = description_tokenizer(description, return_tensors="pt").to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").to(device)

generation = model.generate(input_ids=description_input_ids.input_ids, attention_mask=description_input_ids.attention_mask, prompt_input_ids=prompt_input_ids.input_ids, prompt_attention_mask=prompt_input_ids.attention_mask)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("indic_tts_out.wav", audio_arr, model.config.sampling_rate)

The model includes 69 speakers across 18 officially supported languages, with each language having a set of recommended voices for optimal performance. Below is a table summarizing the available speakers for each language, along with the recommended ones.

Language	Available Speakers	Recommended Speakers
Assamese	Amit, Sita, Poonam, Rakesh	Amit, Sita
Bengali	Arjun, Aditi, Tapan, Rashmi, Arnav, Riya	Arjun, Aditi
Bodo	Bikram, Maya, Kalpana	Bikram, Maya
Chhattisgarhi	Bhanu, Champa	Bhanu, Champa
Dogri	Karan	Karan
English	Thoma, Mary, Swapna, Dinesh, Meera, Jatin, Aakash, Sneha, Kabir, Tisha, Chingkhei, Thoiba, Priya, Tarun, Gauri, Nisha, Raghav, Kavya, Ravi, Vikas, Riya	Thoma, Mary
Gujarati	Yash, Neha	Yash, Neha
Hindi	Rohit, Divya, Aman, Rani	Rohit, Divya
Kannada	Suresh, Anu, Chetan, Vidya	Suresh, Anu
Malayalam	Anjali, Anju, Harish	Anjali, Harish
Manipuri	Laishram, Ranjit	Laishram, Ranjit
Marathi	Sanjay, Sunita, Nikhil, Radha, Varun, Isha	Sanjay, Sunita
Nepali	Amrita	Amrita
Odia	Manas, Debjani	Manas, Debjani
Punjabi	Divjot, Gurpreet	Divjot, Gurpreet

📄 License

This project is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご