Indri-0.1-350m-tts Open-source TTS Model - Supports Efficient Text-to-Speech Conversion for Both English and Indian Languages

Indri 0.1 350m Tts

Developed by 11mlabs

Indri is a novel, ultra-small, lightweight TTS model based on the Transformer architecture, supporting text-to-speech tasks in English and Hindi.

Speech Synthesis

Transformers

Supports Multiple Languages#Lightweight TTS #Multilingual Speech Synthesis #Real-time Audio Generation

Downloads 1,088

Release Time : 11/20/2024

Model Overview

This model models audio as tokens, capable of generating high-quality audio while maintaining speaker style consistency. Supports voice cloning and code-mixed text input.

Model Features

Small and Lightweight

Based on the GPT-2 medium architecture, compact yet powerful

Ultra-fast Inference

Achieves up to 300 toks/s generation speed on RTX6000Ada GPU, with first token latency below 20ms

Voice Cloning

Supports speaker style cloning based on short prompts (<5 seconds)

Multilingual Support

Supports code-mixed input for English and Hindi

Batch Processing

Supports batch processing of approximately 300 sequences on RTX6000Ada

Model Capabilities

Text-to-speech

Voice Cloning

Multilingual Speech Synthesis

Batch Voice Generation

Use Cases

Content Creation

Audiobook Generation

Automatically generates high-quality audio versions for e-books

Offers multiple speaker style options

Educational Content

Generates multilingual speech content for educational materials

Supports mixed English and Hindi content

Business Applications

Voice Assistants

Integrates natural voice output for applications

Low-latency response

Advertising Content

Quickly generates advertising voices in different styles

Supports multiple speaker styles

🚀 Indri-0.1-350m-tts

Indri is a series of audio models capable of performing TTS, ASR, and audio continuation. This medium-sized model (350M) in the series supports TTS tasks in English and Hindi, offering high - quality audio generation.

Key Information

Property	Details
Base Model	openai-community/gpt2
Datasets	speechcolab/gigaspeech, parler-tts/mls_eng_10k, reach-vb/jenny_tts_dataset, MikhailT/hifi-tts, ylacombe/expresso, keithito/lj_speech, collabora/ai4bharat-shrutilipi
Languages	English, Hindi
Library Name	transformers
License	cc-by-sa-4.0
Pipeline Tag	text-to-speech

Useful Links

Platform	Link
🌎 Live Demo	indrivoice.ai
𝕏 Twitter	@11mlabs_in
🐱 GitHub	Indri Repository
🤗 Hugging Face (Collection)	Indri collection
📝 Release Blog	Release Blog

🚀 Quick Start

🤗 pipelines

Use the following code to start using the model. Pipelines are the optimal way to get started.

import torch
import torchaudio
from transformers import pipeline

model_id = '11mlabs/indri-0.1-350m-tts'
task = 'indri-tts'

pipe = pipeline(
    task,
    model=model_id,
    device=torch.device('cuda:0'), # Update this based on your hardware,
    trust_remote_code=True
)

output = pipe(['Hi, my name is Indri and I like to talk.'], speaker = '[spkr_63]')

torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)

Available speakers

Speaker ID	Speaker name
`[spkr_63]`	🇬🇧 👨 book reader
`[spkr_67]`	🇺🇸 👨 influencer
`[spkr_68]`	🇮🇳 👨 book reader
`[spkr_69]`	🇮🇳 👨 book reader
`[spkr_70]`	🇮🇳 👨 motivational speaker
`[spkr_62]`	🇮🇳 👨 book reader heavy
`[spkr_53]`	🇮🇳 👩 recipe reciter
`[spkr_60]`	🇮🇳 👩 book reader
`[spkr_74]`	🇺🇸 👨 book reader
`[spkr_75]`	🇮🇳 👨 entrepreneur
`[spkr_76]`	🇬🇧 👨 nature lover
`[spkr_77]`	🇮🇳 👨 influencer
`[spkr_66]`	🇮🇳 👨 politician

Self hosted service

git clone https://github.com/cmeraki/indri.git
cd indri
pip install -r requirements.txt

# Install ffmpeg (for Mac/Windows, refer here: https://www.ffmpeg.org/download.html)
sudo apt update -y
sudo apt upgrade -y
sudo apt install ffmpeg -y

python -m inference --model_path 11mlabs/indri-0.1-350m-tts --device cuda:0 --port 8000

✨ Features

Small, based on GPT - 2 medium architecture. The methodology can be extended to any autoregressive transformer - based architecture.
Ultra - fast. Using our self hosted service option, on RTX6000Ada NVIDIA GPU the model can achieve speeds up to 300 toks/s (3s of audio generation per s) and under 20ms time to first token.
On RTX6000Ada, it can support a batch size of ~300 sequences with full context length of 1024 tokens.
Supports voice cloning with small prompts (<5s).
Code mixing text input in 2 languages - English and Hindi.

📚 Documentation

Model Details

Model Description

indri-0.1-350m-tts is a novel, ultra - small, and lightweight TTS model based on the transformer architecture. It models audio as tokens and can generate high - quality audio with consistent style cloning of the speaker.

Samples

Text	Sample
मित्रों, हम आज एक नया छोटा और शक्तिशाली मॉडल रिलीज कर रहे हैं।
भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।
Hello दोस्तों, future of speech technology mein अपका स्वागत है
In this model zoo, a new model called Indri has appeared.

Details

Model Type: GPT - 2 based language model
Size: 350M parameters
Language Support: English, Hindi
License: This model is not for commercial usage. This is only a research showcase.

🔧 Technical Details

Here's a brief of how the model works:

Converts input text into tokens.
Runs autoregressive decoding on GPT - 2 based transformer model and generates audio tokens.
Decodes audio tokens (using Kyutai/mimi) to audio.

Please read our blog here for more technical details on how it was built.

📄 License

This model is released under the cc - by - sa - 4.0 license and is not for commercial usage. It is only a research showcase.

Citation

If you use this model in your research, please cite:

@misc{indri-multimodal-alm,
  author       = {11mlabs},
  title        = {Indri: Multimodal audio language model},
  year         = {2024},
  publisher    = {GitHub},
  journal      = {GitHub Repository},
  howpublished = {\url{https://github.com/cmeraki/indri}},
  email        = {compute@merakilabs.com}
}

BibTex

nanoGPT

@techreport{kyutai2024moshi,
      title={Moshi: a speech-text foundation model for real-time dialogue},
      author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
      Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
      year={2024},
      eprint={2410.00037},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2410.00037},
}

Whisper

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

silero-vad

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご