This model models audio as tokens, capable of generating high-quality audio while maintaining speaker style consistency. Supports voice cloning and code-mixed text input.
Model Features
Small and Lightweight
Based on the GPT-2 medium architecture, compact yet powerful
Ultra-fast Inference
Achieves up to 300 toks/s generation speed on RTX6000Ada GPU, with first token latency below 20ms
Voice Cloning
Supports speaker style cloning based on short prompts (<5 seconds)
Multilingual Support
Supports code-mixed input for English and Hindi
Batch Processing
Supports batch processing of approximately 300 sequences on RTX6000Ada
Model Capabilities
Text-to-speech
Voice Cloning
Multilingual Speech Synthesis
Batch Voice Generation
Use Cases
Content Creation
Audiobook Generation
Automatically generates high-quality audio versions for e-books
Offers multiple speaker style options
Educational Content
Generates multilingual speech content for educational materials
Supports mixed English and Hindi content
Business Applications
Voice Assistants
Integrates natural voice output for applications
Low-latency response
Advertising Content
Quickly generates advertising voices in different styles
Supports multiple speaker styles
🚀 Indri-0.1-350m-tts
Indri is a series of audio models capable of performing TTS, ASR, and audio continuation. This medium-sized model (350M) in the series supports TTS tasks in English and Hindi, offering high - quality audio generation.
Use the following code to start using the model. Pipelines are the optimal way to get started.
import torch
import torchaudio
from transformers import pipeline
model_id = '11mlabs/indri-0.1-350m-tts'
task = 'indri-tts'
pipe = pipeline(
task,
model=model_id,
device=torch.device('cuda:0'), # Update this based on your hardware,
trust_remote_code=True
)
output = pipe(['Hi, my name is Indri and I like to talk.'], speaker = '[spkr_63]')
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
Small, based on GPT - 2 medium architecture. The methodology can be extended to any autoregressive transformer - based architecture.
Ultra - fast. Using our self hosted service option, on RTX6000Ada NVIDIA GPU the model can achieve speeds up to 300 toks/s (3s of audio generation per s) and under 20ms time to first token.
On RTX6000Ada, it can support a batch size of ~300 sequences with full context length of 1024 tokens.
Supports voice cloning with small prompts (<5s).
Code mixing text input in 2 languages - English and Hindi.
📚 Documentation
Model Details
Model Description
indri-0.1-350m-tts is a novel, ultra - small, and lightweight TTS model based on the transformer architecture. It models audio as tokens and can generate high - quality audio with consistent style cloning of the speaker.
Samples
Text
Sample
मित्रों, हम आज एक नया छोटा और शक्तिशाली मॉडल रिलीज कर रहे हैं।
भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।
Hello दोस्तों, future of speech technology mein अपका स्वागत है
In this model zoo, a new model called Indri has appeared.
Details
Model Type: GPT - 2 based language model
Size: 350M parameters
Language Support: English, Hindi
License: This model is not for commercial usage. This is only a research showcase.
🔧 Technical Details
Here's a brief of how the model works:
Converts input text into tokens.
Runs autoregressive decoding on GPT - 2 based transformer model and generates audio tokens.
Decodes audio tokens (using Kyutai/mimi) to audio.
Please read our blog here for more technical details on how it was built.
📄 License
This model is released under the cc - by - sa - 4.0 license and is not for commercial usage. It is only a research showcase.
Citation
If you use this model in your research, please cite:
@misc{indri-multimodal-alm,
author = {11mlabs},
title = {Indri: Multimodal audio language model},
year = {2024},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/cmeraki/indri}},
email = {compute@merakilabs.com}
}
@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
}
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}