🚀 Massively Multilingual Speech (MMS): Tachelhit Text-to-Speech
This repository offers a text-to-speech (TTS) model checkpoint for the Tachelhit (shi) language. It's part of Facebook's Massively Multilingual Speech project, aiming to provide speech technology across diverse languages.
✨ Features
- Multilingual Support: Part of a project covering over 1000 languages.
- End - to - End Synthesis: Based on the VITS model for direct text - to - speech conversion.
- Stochastic Variation: Allows for different speech rhythms from the same input text.
📦 Installation
MMS - TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint, first install the latest version of the library:
pip install --upgrade transformers accelerate
💻 Usage Examples
Basic Usage
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("facebook/mms-tts-shi")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-shi")
text = "some example text in the Tachelhit language"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
Advanced Usage
Save the resulting waveform as a .wav
file:
import scipy
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=output)
Or display it in a Jupyter Notebook / Google Colab:
from IPython.display import Audio
Audio(output, rate=model.config.sampling_rate)
📚 Documentation
Model Details
VITS (Variational Inference with adversarial learning for end - to - end Text - to - Speech) is an end - to - end speech synthesis model. It predicts a speech waveform based on an input text sequence. It's a conditional variational autoencoder (VAE) with a posterior encoder, decoder, and conditional prior.
A flow - based module, composed of a Transformer - based text encoder and multiple coupling layers, predicts spectrogram - based acoustic features. The spectrogram is decoded using transposed convolutional layers, similar to the HiFi - GAN vocoder. The model also includes a stochastic duration predictor to handle the one - to - many nature of TTS, enabling different speech rhythms from the same input.
The model is trained end - to - end with combined losses from variational lower bound and adversarial training. Normalizing flows are applied to the conditional prior distribution. During inference, text encodings are up - sampled and mapped to the waveform. Due to the stochastic duration predictor, a fixed seed is needed for consistent results.
For the MMS project, a separate VITS checkpoint is trained for each language.
📄 License
The model is licensed as CC - BY - NC 4.0.
BibTex citation
This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:
@article{pratap2023mms,
title={Scaling Speech Technology to 1,000+ Languages},
author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli},
journal={arXiv},
year={2023}
}