๐ ๐ต Matxa-TTS Catalan Multispeaker
Matxa-TTS is a non-autoregressive model based on Matcha-TTS, designed for fast acoustic modelling in Catalan multispeaker text-to-speech systems, offering high output quality with reduced memory consumption.
๐ Quick Start
๐ฆ Installation
This model has been trained using the espeak-ng open source text-to-speech software. The espeak-ng containing the Catalan phonemizer can be found here.
Create a virtual environment:
python -m venv /path/to/venv
source /path/to/venv/bin/activate
For training and inferencing with Catalan Matxa-TTS you need to compile the provided espeak-ng with the Catalan phonemizer:
git clone https://github.com/projecte-aina/espeak-ng.git
export PYTHON=/path/to/env/<env_name>/bin/python
cd /path/to/espeak-ng
./autogen.sh
./configure --prefix=/path/to/espeak-ng
make
make install
pip cache purge
pip install mecab-python3
pip install unidic-lite
Clone the repository:
git clone -b dev-cat https://github.com/langtech-bsc/Matcha-TTS.git
cd Matcha-TTS
Install the package from source:
pip install -e .
๐ป Usage Examples
๐ Basic Usage
Speech end-to-end inference can be done together with Catalan Matxa-TTS. Both models (Catalan Matxa-TTS and alVoCat) are loaded remotely from the HF hub.
First, export the following environment variables to include the installed espeak-ng version:
export PYTHON=/path/to/your/venv/bin/python
export ESPEAK_DATA_PATH=/path/to/espeak-ng/espeak-ng-data
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/espeak-ng/lib
export PATH="/path/to/espeak-ng/bin:$PATH"
Then you can run the inference script:
cd Matcha-TTS
python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya."
๐ Advanced Usage
You can also modify the length scale (speech rate) and the temperature of the generated sample:
python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya." --length_scale=0.8 --temperature=0.7
๐ For Training
The entire checkpoint is also released to continue training or finetuning. See the repo instructions.
โจ Features
- Fast Acoustic Modelling: Based on Matcha-TTS, an encoder-decoder architecture designed for fast acoustic modelling in TTS.
- Non-autoregressive Model: Trained with optimal-transport conditional flow matching (OT-CFM), capable of generating high output quality in fewer synthesis steps.
- Multispeaker Support: Designed for Catalan multispeaker text-to-speech systems.
๐ง Technical Details
๐ Training data
The model was trained on 2 Catalan speech datasets:
Property |
Details |
Festcat |
Language: ca, Hours: 22, Num. Speakers: 11 |
OpenSLR69 |
Language: ca, Hours: 5, Num. Speakers: 36 |
โ๏ธ Training procedure
Catalan Matcha-TTS was finetuned from the English multispeaker checkpoint, which was trained with the VCTK dataset and provided by the model authors. The embedding layer was initialized with the number of catalan speakers (47) and the original hyperparameters were kept.
โ๏ธ Training Hyperparameters
- batch size: 32 (x2 GPUs)
- learning rate: 1e-4
- number of speakers: 47
- n_fft: 1024
- n_feats: 80
- sample_rate: 22050
- hop_length: 256
- win_length: 1024
- f_min: 0
- f_max: 8000
- data_statistics:
- mel_mean: -6578195
- mel_std: 2.538758
- number of samples: 13340
๐ Evaluation
Validation values obtained from tensorboard from epoch 2399*:
- val_dur_loss_epoch: 0.38
- val_prior_loss_epoch: 0.97
- val_diff_loss_epoch: 2.195
(Note that the finetuning started from epoch 1864, as previous ones were trained with VCTK dataset)
๐ License
Apache 2.0
๐ Documentation
๐ Citation
If this code contributes to your research, please cite the work:
@misc{mehta2024matchatts,
title={Matcha-TTS: A fast TTS architecture with conditional flow matching},
author={Shivam Mehta and Ruibo Tu and Jonas Beskow and รva Szรฉkely and Gustav Eje Henter},
year={2024},
eprint={2309.03199},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
โน๏ธ Additional Information
- Author: The Language Technologies Unit from Barcelona Supercomputing Center.
- Contact: For further information, please send an email to langtech@bsc.es.
- Copyright: Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
- Funding: This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.