matxa-tts-cat-multispeaker Open-source Text-to-Speech Model - Quickly Synthesize High-quality Catalan Voices

Matxa Tts Cat Multispeaker

Developed by projecte-aina

A Catalan multi-speaker text-to-speech model based on Matcha-TTS architecture, trained with optimal transport conditional flow matching for fast and high-quality speech synthesis

Speech Synthesis

PyTorch

OtherOpen Source License:Apache-2.0 #Catalan speech synthesis #Multi-speaker support #Fast acoustic modeling

Downloads 21

Release Time : 3/28/2024

Model Overview

Matxa-TTS is a non-autoregressive text-to-speech model specifically designed for Catalan, supporting multi-speaker speech synthesis. It employs an encoder-decoder architecture combined with optimal transport conditional flow matching training, capable of generating high-quality speech output with fewer synthesis steps.

Model Features

Multi-speaker support

Supports speech synthesis for 47 Catalan speakers

Fast high-quality synthesis

Uses optimal transport conditional flow matching training to generate high-quality speech with fewer synthesis steps

Efficient architecture

Transformer-based U-Net decoder structure with 1D CNN to reduce memory consumption and improve synthesis speed

Language-specific optimization

Fine-tuned using Catalan phonemizer and dedicated datasets for optimized native language support

Model Capabilities

Catalan text-to-speech

Multi-speaker speech synthesis

Adjustable speech rate and generation temperature

High-quality speech output

Use Cases

Speech synthesis applications

Voice assistants

Provides natural speech output for Catalan voice assistants

Supports multiple speaker voice options

Audiobooks

Converts Catalan text into natural speech

Allows adjustment of speech rate and intonation as needed

Assistive technology

Offers text-to-speech functionality in Catalan for visually impaired users

Supports multiple voice options to meet personal preferences

🚀 🍵 Matxa-TTS Catalan Multispeaker

Matxa-TTS is a non-autoregressive model based on Matcha-TTS, designed for fast acoustic modelling in Catalan multispeaker text-to-speech systems, offering high output quality with reduced memory consumption.

🚀 Quick Start

📦 Installation

This model has been trained using the espeak-ng open source text-to-speech software. The espeak-ng containing the Catalan phonemizer can be found here.

Create a virtual environment:

python -m venv /path/to/venv

source /path/to/venv/bin/activate

For training and inferencing with Catalan Matxa-TTS you need to compile the provided espeak-ng with the Catalan phonemizer:

git clone https://github.com/projecte-aina/espeak-ng.git

export PYTHON=/path/to/env/<env_name>/bin/python
cd /path/to/espeak-ng
./autogen.sh
./configure --prefix=/path/to/espeak-ng
make
make install

pip cache purge
pip install mecab-python3
pip install unidic-lite

Clone the repository:

git clone -b dev-cat https://github.com/langtech-bsc/Matcha-TTS.git
cd Matcha-TTS

Install the package from source:

pip install -e .

💻 Usage Examples

🔍 Basic Usage

Speech end-to-end inference can be done together with Catalan Matxa-TTS. Both models (Catalan Matxa-TTS and alVoCat) are loaded remotely from the HF hub.

First, export the following environment variables to include the installed espeak-ng version:

export PYTHON=/path/to/your/venv/bin/python
export ESPEAK_DATA_PATH=/path/to/espeak-ng/espeak-ng-data
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/espeak-ng/lib
export PATH="/path/to/espeak-ng/bin:$PATH"

Then you can run the inference script:

cd Matcha-TTS
python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya."

🌟 Advanced Usage

You can also modify the length scale (speech rate) and the temperature of the generated sample:

python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya." --length_scale=0.8 --temperature=0.7

🔛 For Training

The entire checkpoint is also released to continue training or finetuning. See the repo instructions.

✨ Features

Fast Acoustic Modelling: Based on Matcha-TTS, an encoder-decoder architecture designed for fast acoustic modelling in TTS.
Non-autoregressive Model: Trained with optimal-transport conditional flow matching (OT-CFM), capable of generating high output quality in fewer synthesis steps.
Multispeaker Support: Designed for Catalan multispeaker text-to-speech systems.

🔧 Technical Details

📊 Training data

The model was trained on 2 Catalan speech datasets:

Property	Details
Festcat	Language: ca, Hours: 22, Num. Speakers: 11
OpenSLR69	Language: ca, Hours: 5, Num. Speakers: 36

⚙️ Training procedure

Catalan Matcha-TTS was finetuned from the English multispeaker checkpoint, which was trained with the VCTK dataset and provided by the model authors. The embedding layer was initialized with the number of catalan speakers (47) and the original hyperparameters were kept.

⚖️ Training Hyperparameters

batch size: 32 (x2 GPUs)
learning rate: 1e-4
number of speakers: 47
n_fft: 1024
n_feats: 80
sample_rate: 22050
hop_length: 256
win_length: 1024
f_min: 0
f_max: 8000
data_statistics:
- mel_mean: -6578195
- mel_std: 2.538758
number of samples: 13340

📈 Evaluation

Validation values obtained from tensorboard from epoch 2399*:

val_dur_loss_epoch: 0.38
val_prior_loss_epoch: 0.97
val_diff_loss_epoch: 2.195

(Note that the finetuning started from epoch 1864, as previous ones were trained with VCTK dataset)

📄 License

Apache 2.0

📚 Documentation

📝 Citation

If this code contributes to your research, please cite the work:

@misc{mehta2024matchatts,
      title={Matcha-TTS: A fast TTS architecture with conditional flow matching}, 
      author={Shivam Mehta and Ruibo Tu and Jonas Beskow and Éva Székely and Gustav Eje Henter},
      year={2024},
      eprint={2309.03199},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

ℹ️ Additional Information

Author: The Language Technologies Unit from Barcelona Supercomputing Center.
Contact: For further information, please send an email to langtech@bsc.es.
Copyright: Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
Funding: This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご