Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:
English
Hindi
Model Details
Model Description
indri-0.1-124m-tts is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
Samples
Text
Sample
मित्रों, हम आज एक नया छोटा और शक्तिशाली मॉडल रिलीज कर रहे हैं।
भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।
Hello दोस्तों, future of speech technology mein अपका स्वागत है
In this model zoo, a new model called Indri has appeared.
Key features
Extremely small, based on GPT-2 small architecture. The methodology can be extended to any autoregressive transformer-based architecture.
Ultra-fast. Using our self hosted service option, on RTX6000Ada NVIDIA GPU the model can achieve speeds up to 400 toks/s (4s of audio generation per s) and under 20ms time to first token.
On RTX6000Ada, it can support a batch size of ~1000 sequences with full context length of 1024 tokens
Supports voice cloning with small prompts (<5s).
Code mixing text input in 2 languages - English and Hindi.
Details
Model Type: GPT-2 based language model
Size: 124M parameters
Language Support: English, Hindi
License: This model is not for commercial usage. This is only a research showcase.
Technical details
Here's a brief of how the model works:
Converts input text into tokens
Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
Please read our blog here for more technical details on how it was built.
How to Get Started with the Model
🤗 pipelines
Use the code below to get started with the model. Pipelines are the best way to get started with the model.
import torch
import torchaudio
from transformers import pipeline
model_id = '11mlabs/indri-0.1-124m-tts'
task = 'indri-tts'
pipe = pipeline(
task,
model=model_id,
device=torch.device('cuda:0'), # Update this based on your hardware,
trust_remote_code=True
)
output = pipe(['Hi, my name is Indri and I like to talk.'], speaker = '[spkr_63]')
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
}
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}