TTS_HiFiGAN Open-Source Audio Generation Model - Generate High-Quality Text-to-Speech Audio from Spectrograms for Free

Tts Hifigan

Developed by nvidia

HiFiGAN is a Generative Adversarial Network (GAN) model capable of generating high-quality audio from mel-spectrograms, suitable for text-to-speech systems.

Speech Synthesis

PyTorch

English#High-fidelity speech synthesis #Mel-spectrogram conversion #GAN vocoder

Downloads 5,022

Release Time : 6/29/2022

Model Overview

This model is a vocoder for text-to-speech systems that converts mel-spectrograms into natural speech. Based on GAN architecture, it is particularly suitable for use with spectrogram generation models like FastPitch.

Model Features

High-quality audio generation

Uses GAN architecture to generate high-fidelity speech with an output sampling rate of 22050Hz

Efficient training

Employs multi-scale and multi-period discriminators to improve training stability

Riva compatibility

Can be integrated with NVIDIA Riva Speech AI SDK for efficient deployment

Model Capabilities

Mel-spectrogram to audio conversion

Speech synthesis

High-fidelity audio generation

Use Cases

Speech synthesis systems

Text-to-speech systems

Works with models like FastPitch to build a complete TTS pipeline

Generates natural and fluent American English speech

Voice assistants

Provides high-quality speech output for conversational systems

🚀 NVIDIA Hifigan Vocoder (en-US)

HiFiGAN is a generative adversarial network (GAN) model that can generate audio from mel spectrograms. It provides a solution for high - quality speech synthesis and is compatible with NVIDIA Riva, enabling efficient deployment in various scenarios.

🚀 Quick Start

The model is available for use in the NeMo toolkit and can be used as a pre - trained checkpoint for inference or fine - tuning on another dataset. To use it, you need to install NVIDIA NeMo after installing the latest PyTorch version.

pip install nemo_toolkit['all']

✨ Features

Audio Generation: Generate high - quality audio from mel spectrograms.
NeMo Compatibility: Can be used in the NeMo toolkit for training, fine - tuning, and inference.
Riva Compatibility: Compatible with NVIDIA Riva for efficient deployment.

📦 Installation

To train, fine - tune or play with the model, you should install NVIDIA NeMo. We recommend installing it after the latest PyTorch version.

pip install nemo_toolkit['all']

💻 Usage Examples

Basic Usage

NOTE: To generate audio, you also need a spectrogram generator from NeMo. This example uses the FastPitch model.

# Load FastPitch
from nemo.collections.tts.models import FastPitchModel
spec_generator = FastPitchModel.from_pretrained("nvidia/tts_en_fastpitch")

# Load vocoder
from nemo.collections.tts.models import HifiGanModel
model = HifiGanModel.from_pretrained(model_name="nvidia/tts_hifigan")

Generate Audio

import soundfile as sf
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = model.convert_spectrogram_to_audio(spec=spectrogram)

Save the Generated Audio File

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)

📚 Documentation

Input

This model accepts batches of mel spectrograms.

Output

This model outputs audio at 22050Hz.

🔧 Technical Details

Model Architecture

HiFi - GAN consists of one generator and two discriminators: multi - scale and multi - period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.

Training

The NeMo toolkit was used for training the models for several epochs. These models are trained with this example script and this base config.

Datasets

This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.

🚀 Deployment with NVIDIA Riva

For the best real - time accuracy, latency, and throughput, deploy the model with NVIDIA Riva, an accelerated speech AI SDK deployable on - prem, in all clouds, multi - cloud, hybrid, at the edge, and embedded. Additionally, Riva provides:

World - class out - of - the - box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU - compute hours
Best in class accuracy with run - time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
Streaming speech recognition, Kubernetes compatible scaling, and Enterprise - grade support Check out Riva live demo.

📄 License

This model is licensed under cc - by - 4.0.

📖 References

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご