🚀 Massively Multilingual Speech (MMS): Q’eqchi’ Text-to-Speech
This repository offers a text-to-speech (TTS) model checkpoint for the Q’eqchi’ (kek) language. It's part of Facebook's Massively Multilingual Speech project, aiming to provide speech technology across a wide range of languages. You can find more details about supported languages and their ISO 639-3 codes in the MMS Language Coverage Overview, and view all MMS-TTS checkpoints on the Hugging Face Hub: facebook/mms-tts.
🚀 Quick Start
MMS-TTS has been available in the 🤗 Transformers library since version 4.33. To use this checkpoint, first install the latest version of the library:
pip install --upgrade transformers accelerate
Then, run inference with the following code-snippet:
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("facebook/mms-tts-kek")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-kek")
text = "some example text in the Q’eqchi’ language"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
The resulting waveform can be saved as a .wav
file:
import scipy
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=output)
Or displayed in a Jupyter Notebook / Google Colab:
from IPython.display import Audio
Audio(output, rate=model.config.sampling_rate)
✨ Features
- Multilingual Support: Part of a project aiming to provide speech technology for over 1,000 languages.
- End - to - End Synthesis: The VITS model predicts speech waveforms directly from input text sequences.
- Stochastic Duration Prediction: Allows synthesis of speech with different rhythms from the same input text.
📚 Documentation
Model Details
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model. It predicts a speech waveform based on an input text sequence. It's a conditional variational autoencoder (VAE) consisting of a posterior encoder, decoder, and conditional prior.
A set of spectrogram-based acoustic features are predicted by the flow-based module, which includes a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, similar to the HiFi - GAN vocoder. Given the one-to-many nature of the TTS problem (the same text can be spoken in multiple ways), the model has a stochastic duration predictor, enabling it to synthesize speech with different rhythms from the same input text.
The model is trained end-to-end using a combination of losses from variational lower bound and adversarial training. To enhance the model's expressiveness, normalizing flows are applied to the conditional prior distribution. During inference, text encodings are up-sampled according to the duration prediction module and then mapped to the waveform using the flow module and HiFi - GAN decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic and requires a fixed seed to generate the same speech waveform.
For the MMS project, a separate VITS checkpoint is trained for each language.
💻 Usage Examples
Basic Usage
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("facebook/mms-tts-kek")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-kek")
text = "some example text in the Q’eqchi’ language"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
Saving the Output as a WAV File
import scipy
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=output)
Displaying in Jupyter Notebook / Google Colab
from IPython.display import Audio
Audio(output, rate=model.config.sampling_rate)
📄 BibTex citation
This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:
@article{pratap2023mms,
title={Scaling Speech Technology to 1,000+ Languages},
author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
journal={arXiv},
year={2023}
}
📄 License
The model is licensed as CC-BY-NC 4.0.