MMS-TTS-GBM Open-Source Text-to-Speech Model - Free High-Quality Voice Synthesis in Gavar Language

Home

Mms Tts Gbm

Developed by facebook

Garhwali text-to-speech model developed by Meta, supporting high-quality speech synthesis

Speech Synthesis

Transformers

#Low-resource language TTS #End-to-end speech synthesis #Variational autoencoder

Downloads 18

Release Time : 9/1/2023

Model Overview

This model is part of Meta's Massively Multilingual Speech project, specifically designed for Garhwali text-to-speech synthesis, utilizing the VITS architecture for high-quality speech generation

Model Features

Multilingual support

Part of Meta's Massively Multilingual Speech project, supporting speech synthesis in multiple languages

High-quality speech synthesis

Utilizes VITS architecture, combining variational lower bound loss and adversarial training to generate high-quality speech waveforms

Enhanced expressiveness

Applies normalizing flow techniques in conditional prior distribution to enhance speech expressiveness

Non-deterministic output

Random duration predictor enables synthesis of different rhythmic variations from the same text

Model Capabilities

Text-to-speech

Multilingual speech synthesis

High-quality waveform generation

Use Cases

Speech technology applications

Garhwali voice assistant

Developing voice assistant applications for Garhwali users

Provides natural and fluent voice interaction experience

Educational applications

Used for speech synthesis in Garhwali learning materials

Helps learners obtain standard pronunciation examples

🚀 Massively Multilingual Speech (MMS): Garhwali Text-to-Speech

This repository offers a text-to-speech (TTS) model checkpoint for the Garhwali (gbm) language. It's part of Facebook's Massively Multilingual Speech project, aiming to provide speech technology across diverse languages.

🚀 Quick Start

MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint, follow these steps:

First, install the latest version of the library:

pip install --upgrade transformers accelerate

Then, run inference with the following code:

from transformers import VitsModel, AutoTokenizer
import torch

model = VitsModel.from_pretrained("facebook/mms-tts-gbm")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-gbm")

text = "some example text in the Garhwali language"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs).waveform

The resulting waveform can be saved as a .wav file:

import scipy

scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=output)

Or displayed in a Jupyter Notebook / Google Colab:

from IPython.display import Audio

Audio(output, rate=model.config.sampling_rate)

✨ Features

Multilingual Support: Part of the MMS project, aiming to cover over 1000 languages.
End - to - End Synthesis: Based on the VITS model for direct text - to - speech conversion.
Stochastic Duration Prediction: Allows for different speech rhythms from the same input text.

📚 Documentation

Model Details

VITS (Variational Inference with adversarial learning for end - to - end Text - to - Speech) is an end - to - end speech synthesis model. It predicts a speech waveform based on an input text sequence. It's a conditional variational autoencoder (VAE) with a posterior encoder, decoder, and conditional prior.

A flow - based module, consisting of a Transformer - based text encoder and multiple coupling layers, predicts a set of spectrogram - based acoustic features. The spectrogram is decoded using transposed convolutional layers, similar to the HiFi - GAN vocoder.

Due to the one - to - many nature of the TTS problem, the model includes a stochastic duration predictor, enabling it to synthesize speech with different rhythms from the same input text.

The model is trained end - to - end with a combination of losses from variational lower bound and adversarial training. Normalizing flows are applied to the conditional prior distribution to enhance the model's expressiveness.

During inference, text encodings are up - sampled based on the duration prediction module and then mapped to the waveform using the flow module and HiFi - GAN decoder. Since the duration predictor is stochastic, the model is non - deterministic and requires a fixed seed to generate the same speech waveform.

For the MMS project, a separate VITS checkpoint is trained for each language.

📄 License

The model is licensed as CC - BY - NC 4.0.

BibTex citation

This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:

@article{pratap2023mms,
    title={Scaling Speech Technology to 1,000+ Languages},
    author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli},
    journal={arXiv},
    year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご