Open-Source Audio Generation Model BigVGAN - Generate High-Quality Audio Waveforms from Mel Spectrogram for Free

Bigvgan V2 22khz 80band 256x

Developed by nvidia

BigVGAN is a general-purpose neural vocoder trained at scale, capable of generating high-quality audio waveforms from mel spectrograms.

Speech Synthesis Open Source License:MIT #High-fidelity audio synthesis #Multi-scale discriminator #CUDA-accelerated inference

Downloads 503.23k

Release Time : 7/15/2024

Model Overview

BigVGAN is a high-performance neural vocoder that supports various audio types including speech, environmental sounds, and musical instruments through large-scale training. The latest v2 version significantly improves inference speed with custom CUDA kernels.

Model Features

High-performance inference

Achieves 1.5-3x inference speed improvement with custom CUDA kernels

Large-scale training

Trained on diverse audio datasets to support multiple audio types

High-quality audio generation

Achieves state-of-the-art results on benchmarks like LibriTTS

Multi-configuration support

Provides pretrained models with various sampling rates (22kHz/24kHz/44kHz) and upsampling factors

Model Capabilities

Generate high-quality audio from mel spectrograms

Support audio generation at various sampling rates

Fast inference (using CUDA kernels)

Use Cases

Speech synthesis

TTS system backend

Serves as the vocoder component for text-to-speech systems

Generates natural and fluent speech

Audio enhancement

Audio super-resolution

Enhances sampling rate and clarity of low-quality audio

🚀 BigVGAN: A Universal Neural Vocoder with Large-Scale Training

BigVGAN is a universal neural vocoder that enables high-quality audio generation through large-scale training.

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

[Paper] - [Code] - [Showcase] - [Project Page] - [Weights] - [Demo]

🚀 Quick Start

This repository provides pretrained BigVGAN checkpoints, allowing for easy inference and additional huggingface_hub support. If you're interested in model training and other features, visit the official GitHub repository: https://github.com/NVIDIA/BigVGAN.

✨ Features

News

Jul 2024 (v2.3):
- General refactoring and code improvements for better readability.
- A fully fused CUDA kernel for anti-aliased activation (upsampling + activation + downsampling) with an inference speed benchmark.
Jul 2024 (v2.2): The repository now includes an interactive local demo using Gradio.
Jul 2024 (v2.1): BigVGAN is now integrated with 🤗 Hugging Face Hub, enabling easy inference with pretrained checkpoints. An interactive demo is also available on Hugging Face Spaces.
Jul 2024 (v2): We released BigVGAN-v2 along with pretrained checkpoints. Highlights include:
- A custom CUDA kernel for inference: A fused upsampling + activation kernel written in CUDA for accelerated inference speed. Tests show 1.5 - 3x faster speed on a single A100 GPU.
- An improved discriminator and loss: BigVGAN-v2 is trained using a multi-scale sub-band CQT discriminator and a multi-scale mel spectrogram loss.
- Larger training data: BigVGAN-v2 is trained on datasets with diverse audio types, including multi-language speech, environmental sounds, and instruments.
- We offer pretrained checkpoints of BigVGAN-v2 with diverse audio configurations, supporting up to 44 kHz sampling rate and 512x upsampling ratio.

📦 Installation

git lfs install
git clone https://huggingface.co/nvidia/bigvgan_v2_22khz_80band_256x

💻 Usage Examples

Basic Usage

The following example demonstrates how to use BigVGAN: load the pretrained BigVGAN generator from Hugging Face Hub, compute the mel spectrogram from the input waveform, and generate a synthesized waveform using the mel spectrogram as the model input.

device = 'cuda'

import torch
import bigvgan
import librosa
from meldataset import get_mel_spectrogram

# instantiate the model. You can optionally set use_cuda_kernel=True for faster inference.
model = bigvgan.BigVGAN.from_pretrained('nvidia/bigvgan_v2_22khz_80band_256x', use_cuda_kernel=False)

# remove weight norm in the model and set to eval mode
model.remove_weight_norm()
model = model.eval().to(device)

# load wav file and compute mel spectrogram
wav_path = '/path/to/your/audio.wav'
wav, sr = librosa.load(wav_path, sr=model.h.sampling_rate, mono=True) # wav is np.ndarray with shape [T_time] and values in [-1, 1]
wav = torch.FloatTensor(wav).unsqueeze(0) # wav is FloatTensor with shape [B(1), T_time]

# compute mel spectrogram from the ground truth audio
mel = get_mel_spectrogram(wav, model.h).to(device) # mel is FloatTensor with shape [B(1), C_mel, T_frame]

# generate waveform from mel
with torch.inference_mode():
    wav_gen = model(mel) # wav_gen is FloatTensor with shape [B(1), 1, T_time] and values in [-1, 1]
wav_gen_float = wav_gen.squeeze(0).cpu() # wav_gen is FloatTensor with shape [1, T_time]

# you can convert the generated waveform to 16 bit linear PCM
wav_gen_int16 = (wav_gen_float * 32767.0).numpy().astype('int16') # wav_gen is now np.ndarray with shape [1, T_time] and int16 dtype

Advanced Usage

You can apply the fast CUDA inference kernel by using the parameter use_cuda_kernel when instantiating BigVGAN:

import bigvgan
model = bigvgan.BigVGAN.from_pretrained('nvidia/bigvgan_v2_22khz_80band_256x', use_cuda_kernel=True)

When applied for the first time, it builds the kernel using nvcc and ninja. If the build succeeds, the kernel is saved to alias_free_activation/cuda/build and the model automatically loads the kernel. The codebase has been tested using CUDA 12.1.

Please ensure that both are installed in your system and that the nvcc version in your system matches the version used by your PyTorch build.

For more details, see the official GitHub repository: https://github.com/NVIDIA/BigVGAN?tab=readme-ov-file#using-custom-cuda-kernel-for-synthesis

📚 Documentation

Pretrained Models

We offer pretrained models on Hugging Face Collections. You can download the generator weight checkpoints (named bigvgan_generator.pt) and their discriminator/optimizer states (named bigvgan_discriminator_optimizer.pt) from the listed model repositories.

Model Name	Sampling Rate	Mel band	fmax	Upsampling Ratio	Params	Dataset	Steps	Fine-Tuned
bigvgan_v2_44khz_128band_512x	44 kHz	128	22050	512	122M	Large-scale Compilation	5M	No
bigvgan_v2_44khz_128band_256x	44 kHz	128	22050	256	112M	Large-scale Compilation	5M	No
bigvgan_v2_24khz_100band_256x	24 kHz	100	12000	256	112M	Large-scale Compilation	5M	No
bigvgan_v2_22khz_80band_256x	22 kHz	80	11025	256	112M	Large-scale Compilation	5M	No
bigvgan_v2_22khz_80band_fmax8k_256x	22 kHz	80	8000	256	112M	Large-scale Compilation	5M	No
bigvgan_24khz_100band	24 kHz	100	12000	256	112M	LibriTTS	5M	No
bigvgan_base_24khz_100band	24 kHz	100	12000	256	14M	LibriTTS	5M	No
bigvgan_22khz_80band	22 kHz	80	8000	256	112M	LibriTTS + VCTK + LJSpeech	5M	No
bigvgan_base_22khz_80band	22 kHz	80	8000	256	14M	LibriTTS + VCTK + LJSpeech	5M	No

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご