Open-source FastSpeech2 Conformer Model - Efficiently Convert Text to High-quality Speech Quickly

Fastspeech2 Conformer

Developed by espnet

FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model that combines the advantages of FastSpeech2 and the Conformer architecture, enabling fast and efficient generation of high-quality speech from text.

Speech Synthesis

Transformers

EnglishOpen Source License:Apache-2.0 #Non-autoregressive TTS #High-fidelity speech synthesis #Multilingual support

Downloads 2,440

Release Time : 6/6/2023

Model Overview

This model addresses some limitations of FastSpeech by directly using real targets for training and introduces more speech variation information as conditional inputs. The Conformer architecture uses convolutional layers within transformer blocks to capture local speech patterns, while attention layers capture relationships between distant parts of the input.

Model Features

Non-autoregressive architecture

Generates speech faster compared to autoregressive models

Multi-condition inputs

Introduces pitch, energy, and more accurate duration as conditional inputs

Hybrid architecture

Combines Conformer's convolutional layers and attention mechanisms to effectively capture both local and global speech features

Model Capabilities

Text-to-Speech

High-quality speech synthesis

Fast speech generation

Use Cases

Speech synthesis

Voice assistants

Provides natural voice output for smart assistants

Audiobooks

Automatically converts text content into speech

🚀 FastSpeech2Conformer

FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model. It combines the advantages of FastSpeech2 and the conformer architecture, enabling it to generate high - quality speech from text rapidly and efficiently.

✨ Features

FastSpeech2Conformer is a non - autoregressive TTS model, which can generate speech much faster than autoregressive models.
It directly trains the model with ground - truth targets, addressing some limitations of its predecessor, FastSpeech.
It introduces more speech variation information (e.g., pitch, energy, and more accurate duration) as conditional inputs.
The conformer (convolutional transformer) architecture captures local speech patterns using convolutions within transformer blocks, while the attention layer captures long - range relationships in the input.

📦 Installation

You can run FastSpeech2Conformer locally with the 🤗 Transformers library. First, install the 🤗 Transformers library and g2p - en:

pip install --upgrade pip
pip install --upgrade transformers g2p - en

💻 Usage Examples

Basic Usage

Run inference via the Transformers modelling code with the model and hifigan separately:

from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerModel, FastSpeech2ConformerHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer")
output_dict = model(input_ids, return_dict=True)
spectrogram = output_dict["spectrogram"]

hifigan = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
waveform = hifigan(spectrogram)

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)

Advanced Usage

Run inference via the Transformers modelling code with the model and hifigan combined:

from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)

Run inference with a pipeline and specify which vocoder to use:

from transformers import pipeline, FastSpeech2ConformerHifiGan
import soundfile as sf

vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
synthesiser = pipeline(model="espnet/fastspeech2_conformer", vocoder=vocoder)

speech = synthesiser("Hello, my dog is cooler than you!")

sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"])

📚 Documentation

Model Description

The FastSpeech2Conformer model was proposed in the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia - Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. It was first released in this repository. The license used is Apache 2.0.

Model Sources

Repository: ESPnet
Paper: Recent Developments On Espnet Toolkit Boosted By Conformer

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information is needed for further recommendations.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

📄 License

The model uses the Apache 2.0 license.

Model Card Authors

Connor Henderson (Disclaimer: no ESPnet affiliation)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご