Vocos Open-Source Fast Neural Vocoder - High-Efficiency Audio Reconstruction Empowers Text-to-Speech Tasks

Vocos Mel Hifigan Compat 44100khz

Developed by patriotyk

Vocos is a fast neural vocoder that achieves efficient audio reconstruction by generating spectral coefficients, particularly suitable for text-to-speech tasks.

Speech Synthesis

TensorBoard

OtherOpen Source License:MIT #Fast Spectral Reconstruction #Mel-Spectrogram Compatible #High-Fidelity Speech Synthesis

Downloads 2,222

Release Time : 5/10/2024

Model Overview

Vocos is a fast neural vocoder specifically designed for synthesizing audio waveforms from acoustic features. It achieves rapid audio reconstruction by generating spectral coefficients and utilizing inverse Fourier transform, offering faster processing speeds compared to traditional GAN vocoders.

Model Features

Fast Spectral Reconstruction

Achieves faster audio reconstruction by generating spectral coefficients instead of directly modeling time-domain audio samples

High-Fidelity Audio Synthesis

Uses mel-spectrograms as acoustic features to generate high-quality audio waveforms

Compatibility with Multiple TTS Models

Designed to be compatible with acoustic outputs from various text-to-speech models

Efficient Training

Training can be completed in about one month using two RTX-3090 GPUs

Model Capabilities

Mel-Spectrogram to Audio Conversion

High-Fidelity Speech Synthesis

Fast Audio Reconstruction

Use Cases

Speech Synthesis

Text-to-Speech System

Serves as the backend vocoder for TTS systems, converting mel-spectrograms into natural speech

Generates high-quality speech output

Audio Processing

Speech Enhancement

Transforms and reconstructs speech features

May improve speech quality

🚀 Vocos: Fast Neural Vocoder

Vocos is a fast neural vocoder that synthesizes audio waveforms from acoustic features. It offers a quicker alternative to hifi - gan and is compatible with the acoustic output of multiple TTS models.

🚀 Quick Start

The model is mainly used as a vocoder to synthesize audio waveforms from mel spectrograms. However, it's trained for speech generation, and may not produce high - quality samples in other audio domains.

Installation

To use Vocos only in inference mode, install it using:

pip install git+https://github.com/langtech-bsc/vocos.git@matcha

Reconstruct audio from mel - spectrogram

import torch

from vocos import Vocos

vocos = Vocos.from_pretrained("patriotyk/vocos-mel-hifigan-compat-44100khz")

mel = torch.randn(1, 80, 256)  # B, C, T
audio = vocos.decode(mel)

✨ Features

Fast Synthesis: Vocos synthesizes audio quickly by generating spectral coefficients and using inverse Fourier transform.
Compatibility: It is compatible with the acoustic output of several TTS models and uses 80 - bin mel spectrograms, which are common in the TTS domain.

📦 Installation

To use Vocos only in inference mode, install it using:

pip install git+https://github.com/langtech-bsc/vocos.git@matcha

💻 Usage Examples

Basic Usage

import torch

from vocos import Vocos

vocos = Vocos.from_pretrained("patriotyk/vocos-mel-hifigan-compat-44100khz")

mel = torch.randn(1, 80, 256)  # B, C, T
audio = vocos.decode(mel)

📚 Documentation

Model Description

Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Unlike other typical GAN - based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.

This version of vocos uses 80 - bin mel spectrograms as acoustic features which are widespread in the TTS domain since the introduction of hifi - gan. The goal of this model is to provide an alternative to hifi - gan that is faster and compatible with the acoustic output of several TTS models.

Intended Uses and limitations

The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. It is trained to generate speech, and if used in other audio domains, it's possible that the model won't produce high - quality samples.

Training Data

The model was trained on a private 800+ hours dataset, made from Ukrainian audio books, using narizaka tool.

Training Procedure

The model was trained for 2.0M steps and 210 epochs with a batch size of 20. A Cosine scheduler with an initial learning rate of 3e - 4 was used. Two RTX - 3090 video cards were used for training, and it took about one month of continuous training.

Training Hyperparameters

initial_learning_rate: 3e - 4
scheduler: cosine without warmup or restarts
mel_loss_coeff: 45
mrd_loss_coeff: 1.0
batch_size: 20
num_samples: 32768

Evaluation

Evaluation was done using the metrics on the original repo. After 210 epochs, the model achieved:

val_loss: 3.703
f1_score: 0.950
mel_loss: 0.248
periodicity_loss: 0.127
pesq_score: 3.399
pitch_loss: 38.26
utmos_score: 3.146

🔧 Technical Details

Training Data

The model was trained on a private 800+ hours dataset, made from Ukrainian audio books, using narizaka tool.

Training Procedure

Training Hyperparameters

initial_learning_rate: 3e - 4
scheduler: cosine without warmup or restarts
mel_loss_coeff: 45
mrd_loss_coeff: 1.0
batch_size: 20
num_samples: 32768

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご