Indri 0.1 124m Tts

Developed by 11mlabs

Indri is an ultra-compact lightweight TTS model based on Transformer architecture, supporting English and Hindi text-to-speech tasks.

Speech Synthesis

Transformers

Supports Multiple Languages#Lightweight TTS #Multilingual Mixed Generation #Voice Cloning

Downloads 182

Release Time : 11/12/2024

Model Overview

This model can generate high-quality audio while maintaining speaker style cloning consistency, supporting voice cloning through short prompts.

Model Features

Ultra-Compact Lightweight

Based on GPT-2 small architecture with only 124M parameters, scalable to any autoregressive Transformer-based architecture

Ultra-Fast Inference

Achieves speeds up to 400 tokens/s on RTX6000Ada GPU, with first token latency below 20ms

Voice Cloning Support

Enables speaker style cloning with short prompts (<5 seconds)

Multilingual Mixed Support

Supports code-mixed text input for English and Hindi

Model Capabilities

Text-to-Speech

Voice Cloning

Multilingual Mixed Processing

Use Cases

Speech Synthesis

Multilingual Audiobooks

Generates natural speech for English and Hindi content

High-quality audio output with speaker consistency

Voice Assistants

Provides speech synthesis capabilities for multilingual voice assistants

Supports fast-response voice generation

Education

Language Learning Tools

Provides pronunciation examples for language learners

Supports bilingual mixed pronunciation demonstrations

base_model:

openai-community/gpt2 datasets:
speechcolab/gigaspeech
parler-tts/mls_eng_10k
reach-vb/jenny_tts_dataset
MikhailT/hifi-tts
ylacombe/expresso
keithito/lj_speech
collabora/ai4bharat-shrutilipi language:
en
hi library_name: transformers license: cc-by-sa-4.0 pipeline_tag: text-to-speech

Platform	Link
🌎 Live Demo	indrivoice.ai
𝕏 Twitter	@11mlabs_in
🐱 GitHub	Indri Repository
🤗 Hugging Face (Collection)	Indri collection
🤗 Hugging Face (Spaces)	Live Server
📝 Release Blog	Release Blog

Model Card for indri-0.1-124m-tts

Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:

English
Hindi

Model Details

Model Description

indri-0.1-124m-tts is a novel, ultra-small, and lightweight TTS model based on the transformer architecture. It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.

Samples

Text	Sample
मित्रों, हम आज एक नया छोटा और शक्तिशाली मॉडल रिलीज कर रहे हैं।
भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।
Hello दोस्तों, future of speech technology mein अपका स्वागत है
In this model zoo, a new model called Indri has appeared.

Key features

Extremely small, based on GPT-2 small architecture. The methodology can be extended to any autoregressive transformer-based architecture.
Ultra-fast. Using our self hosted service option, on RTX6000Ada NVIDIA GPU the model can achieve speeds up to 400 toks/s (4s of audio generation per s) and under 20ms time to first token.
On RTX6000Ada, it can support a batch size of ~1000 sequences with full context length of 1024 tokens
Supports voice cloning with small prompts (<5s).
Code mixing text input in 2 languages - English and Hindi.

Details

Model Type: GPT-2 based language model
Size: 124M parameters
Language Support: English, Hindi
License: This model is not for commercial usage. This is only a research showcase.

Technical details

Here's a brief of how the model works:

Converts input text into tokens
Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
Decodes audio tokens (using Kyutai/mimi) to audio

Please read our blog here for more technical details on how it was built.

How to Get Started with the Model

🤗 pipelines

Use the code below to get started with the model. Pipelines are the best way to get started with the model.

import torch
import torchaudio
from transformers import pipeline

model_id = '11mlabs/indri-0.1-124m-tts'
task = 'indri-tts'

pipe = pipeline(
    task,
    model=model_id,
    device=torch.device('cuda:0'), # Update this based on your hardware,
    trust_remote_code=True
)

output = pipe(['Hi, my name is Indri and I like to talk.'], speaker = '[spkr_63]')

torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)

Available speakers

Speaker ID	Speaker name
`[spkr_63]`	🇬🇧 👨 book reader
`[spkr_67]`	🇺🇸 👨 influencer
`[spkr_68]`	🇮🇳 👨 book reader
`[spkr_69]`	🇮🇳 👨 book reader
`[spkr_70]`	🇮🇳 👨 motivational speaker
`[spkr_62]`	🇮🇳 👨 book reader heavy
`[spkr_53]`	🇮🇳 👩 recipe reciter
`[spkr_60]`	🇮🇳 👩 book reader
`[spkr_74]`	🇺🇸 👨 book reader
`[spkr_75]`	🇮🇳 👨 entrepreneur
`[spkr_76]`	🇬🇧 👨 nature lover
`[spkr_77]`	🇮🇳 👨 influencer
`[spkr_66]`	🇮🇳 👨 politician

Self hosted service

git clone https://github.com/indri-voice/indri.git
cd indri
pip install -r requirements.txt

# Install ffmpeg (for Mac/Windows, refer here: https://www.ffmpeg.org/download.html)
sudo apt update -y
sudo apt upgrade -y
sudo apt install ffmpeg -y

python -m inference --model_path 11mlabs/indri-0.1-124m-tts --device cuda:0 --port 8000

Citation

If you use this model in your research, please cite:

@misc{indri-multimodal-alm,
  author       = {11mlabs},
  title        = {Indri: Multimodal audio language model},
  year         = {2024},
  publisher    = {GitHub},
  journal      = {GitHub Repository},
  howpublished = {\url{https://github.com/indri-voice/indri}},
  email        = {apurvagup@gmail.com, romit.73@gmail.com}
}

BibTex

@techreport{kyutai2024moshi,
      title={Moshi: a speech-text foundation model for real-time dialogue},
      author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
      Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
      year={2024},
      eprint={2410.00037},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2410.00037},
}

Whisper

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

silero-vad

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご