nvidia_tts_en_hifitts_hifigan_ft_fastpitch Open Source Model - Achieve High-Quality Audio in Multi-Speaker English Speech Synthesis

Nvidia Tts En Hifitts Hifigan Ft Fastpitch

Developed by Mastering-Python-HF

HiFiGAN is a GAN-based vocoder model capable of generating high-quality audio from mel-spectrograms, supporting multi-speaker English voice synthesis.

Speech Synthesis English#High-fidelity voice synthesis #Multi-speaker support #Mel-spectrogram conversion

Downloads 16

Release Time : 7/10/2023

Model Overview

This model upsamples mel-spectrograms into audio signals through transposed convolution, primarily used as the backend vocoder in text-to-speech systems, compatible with frontend models like FastPitch.

Model Features

High-quality audio generation

Generates natural and smooth speech waveforms based on GAN architecture, supporting 44.1kHz high sampling rate

Multi-speaker support

Built-in 10 different speaker IDs for generating voices with different timbres

Fully parallel processing

Adopts a fully parallel Transformer architecture, significantly outperforming traditional models in synthesis speed

Pitch control

Enhances expressiveness of synthesized speech by predicting pitch contours

Model Capabilities

Text-to-speech

Mel-spectrogram conversion

Multi-speaker voice generation

Pitch adjustment

Use Cases

Voice synthesis

Audiobook production

Generates natural voices for e-books, news, and other content

Supports multi-speaker output with different timbres

Voice assistants

Provides high-quality voice output for virtual assistants

44.1kHz sampling rate delivers clear audio quality

🚀 NVIDIA Hifigan Vocoder (en-US)

HiFiGAN is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel spectrograms to audio, offering a powerful solution for text - to - speech tasks.

🚀 Quick Start

The model is available for use in the NeMo toolkit [3] and can be used as a pre - trained checkpoint for inference or for fine - tuning on another dataset. To work with the model, you need to install NVIDIA NeMo. We recommend installing it after the latest PyTorch version.

git clone https://github.com/NVIDIA/NeMo
cd NeMo
BRANCH = 'main'
python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

✨ Features

Audio Generation: HiFiGAN can generate audio from mel spectrograms.
Versatile Usage: It can be used as a pre - trained checkpoint for inference or fine - tuning in the NeMo toolkit.
Multispeaker Support: The associated models support generating multispeaker English voices with American and UK accents.

💻 Usage Examples

Basic Usage

# Note: This model generates only spectrograms and a vocoder is needed to convert the spectrograms to waveforms. In this example HiFiGAN is used.
from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

REPO_ID = "Mastering-Python-HF/nvidia_tts_en_fastpitch_multispeaker"
FILENAME = "tts_en_fastpitch_multispeaker.nemo"
path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)

spec_generator = FastPitchModel.restore_from(restore_path=path)

REPO_ID = "Mastering-Python-HF/nvidia_tts_en_hifitts_hifigan_ft_fastpitch"
FILENAME = "tts_en_hifitts_hifigan_ft_fastpitch.nemo"
path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)

model = HifiGanModel.restore_from(restore_path=path)

Advanced Usage

import soundfile as sf
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
"""
speaker id:
    92     Cori Samuel
    6097   Phil Benson
    9017   John Van Stan
    6670   Mike Pelton
    6671   Tony Oliva
    8051   Maria Kasper
    9136   Helen Taylor
    11614  Sylviamb
    11697  Celine Major
    12787  LikeManyWaters
"""
spectrogram = spec_generator.generate_spectrogram(tokens=parsed,speaker=92)
audio = model.convert_spectrogram_to_audio(spec=spectrogram)
sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 44100)

📚 Documentation

Input

This model accepts batches of text.

Output

This model generates mel spectrograms.

Model Architecture

FastPitch multispeaker is a fully - parallel text - to - speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully - parallel Transformer architecture, with a much higher real - time factor than Tacotron2 for the mel - spectrogram synthesis of a typical utterance. It uses an unsupervised speech - text aligner.

Training

The NeMo toolkit [3] was used for training the models for 1000 epochs.

Datasets

This model is trained on HiFiTTS sampled at 44100Hz, and has been tested on generating multispeaker English voices with an American and UK accent.

Performance

No performance information is available at this time.

Limitations

This checkpoint only works well with vocoders that were trained on 44100Hz data. Otherwise, the generated audio may be scratchy or choppy - sounding.

References

Colab example

LINK : nvidia_tts_en_fastpitch_multispeaker

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご