🚀 NVIDIA Hifigan Vocoder (en-US)
HiFiGAN is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel spectrograms to audio, offering a powerful solution for text - to - speech tasks.
🚀 Quick Start
The model is available for use in the NeMo toolkit [3] and can be used as a pre - trained checkpoint for inference or for fine - tuning on another dataset. To work with the model, you need to install NVIDIA NeMo. We recommend installing it after the latest PyTorch version.
git clone https://github.com/NVIDIA/NeMo
cd NeMo
BRANCH = 'main'
python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
✨ Features
- Audio Generation: HiFiGAN can generate audio from mel spectrograms.
- Versatile Usage: It can be used as a pre - trained checkpoint for inference or fine - tuning in the NeMo toolkit.
- Multispeaker Support: The associated models support generating multispeaker English voices with American and UK accents.
💻 Usage Examples
Basic Usage
from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel
REPO_ID = "Mastering-Python-HF/nvidia_tts_en_fastpitch_multispeaker"
FILENAME = "tts_en_fastpitch_multispeaker.nemo"
path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)
spec_generator = FastPitchModel.restore_from(restore_path=path)
REPO_ID = "Mastering-Python-HF/nvidia_tts_en_hifitts_hifigan_ft_fastpitch"
FILENAME = "tts_en_hifitts_hifigan_ft_fastpitch.nemo"
path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)
model = HifiGanModel.restore_from(restore_path=path)
Advanced Usage
import soundfile as sf
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
"""
speaker id:
92 Cori Samuel
6097 Phil Benson
9017 John Van Stan
6670 Mike Pelton
6671 Tony Oliva
8051 Maria Kasper
9136 Helen Taylor
11614 Sylviamb
11697 Celine Major
12787 LikeManyWaters
"""
spectrogram = spec_generator.generate_spectrogram(tokens=parsed,speaker=92)
audio = model.convert_spectrogram_to_audio(spec=spectrogram)
sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 44100)
📚 Documentation
Input
This model accepts batches of text.
Output
This model generates mel spectrograms.
Model Architecture
FastPitch multispeaker is a fully - parallel text - to - speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully - parallel Transformer architecture, with a much higher real - time factor than Tacotron2 for the mel - spectrogram synthesis of a typical utterance. It uses an unsupervised speech - text aligner.
Training
The NeMo toolkit [3] was used for training the models for 1000 epochs.
Datasets
This model is trained on HiFiTTS sampled at 44100Hz, and has been tested on generating multispeaker English voices with an American and UK accent.
Performance
No performance information is available at this time.
Limitations
This checkpoint only works well with vocoders that were trained on 44100Hz data. Otherwise, the generated audio may be scratchy or choppy - sounding.
References
Colab example