🚀 FastSpeech2Conformer
FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model. It combines the advantages of FastSpeech2 and the conformer architecture, enabling it to generate high - quality speech from text rapidly and efficiently.
✨ Features
- FastSpeech2Conformer is a non - autoregressive TTS model, which can generate speech much faster than autoregressive models.
- It directly trains the model with ground - truth targets, addressing some limitations of its predecessor, FastSpeech.
- It introduces more speech variation information (e.g., pitch, energy, and more accurate duration) as conditional inputs.
- The conformer (convolutional transformer) architecture captures local speech patterns using convolutions within transformer blocks, while the attention layer captures long - range relationships in the input.
📦 Installation
You can run FastSpeech2Conformer locally with the 🤗 Transformers library. First, install the 🤗 Transformers library and g2p - en:
pip install --upgrade pip
pip install --upgrade transformers g2p - en
💻 Usage Examples
Basic Usage
Run inference via the Transformers modelling code with the model and hifigan separately:
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerModel, FastSpeech2ConformerHifiGan
import soundfile as sf
tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]
model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer")
output_dict = model(input_ids, return_dict=True)
spectrogram = output_dict["spectrogram"]
hifigan = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
waveform = hifigan(spectrogram)
sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
Advanced Usage
- Run inference via the Transformers modelling code with the model and hifigan combined:
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf
tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]
model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]
sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
- Run inference with a pipeline and specify which vocoder to use:
from transformers import pipeline, FastSpeech2ConformerHifiGan
import soundfile as sf
vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
synthesiser = pipeline(model="espnet/fastspeech2_conformer", vocoder=vocoder)
speech = synthesiser("Hello, my dog is cooler than you!")
sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"])
📚 Documentation
Model Description
The FastSpeech2Conformer model was proposed in the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia - Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. It was first released in this repository. The license used is Apache 2.0.
Model Sources
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information is needed for further recommendations.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
📄 License
The model uses the Apache 2.0 license.
Model Card Authors
Connor Henderson (Disclaimer: no ESPnet affiliation)