tts_en_fastpitch Open Source Text-to-Speech Model - Precise Voice Control to Generate High-Quality American English Voices

Tts En Fastpitch

Developed by nvidia

FastPitch is a fully parallel Transformer-based text-to-speech model capable of controlling pitch and phoneme duration, generating high-quality American English speech.

Speech Synthesis

PyTorch

English#Parallel Text-to-Speech #Pitch Controllable #Real-time Speech Synthesis

Downloads 4,701

Release Time : 6/28/2022

Model Overview

A Transformer-based parallel TTS model that generates expressive speech by predicting pitch contours, supporting real-time speech synthesis.

Model Features

Fully Parallel Architecture

Transformer-based fully parallel design for efficient speech synthesis

Pitch Control

Predictable and adjustable pitch contours for more expressive speech

Real-time Synthesis

Higher real-time factor compared to traditional Tacotron2 models

Unsupervised Alignment

Uses unsupervised speech-text aligner to improve synthesis accuracy

Model Capabilities

English Text-to-Speech

Pitch Control

Real-time Speech Synthesis

Mel-spectrogram Generation

Use Cases

Speech Synthesis

Voice Assistants

Generate natural and fluent speech responses for virtual assistants

Produces expressive American English speech

Audiobooks

Convert text content into speech for audiobook production

Adjustable pitch and speech rate for enhanced listening experience

🚀 NVIDIA FastPitch (en-US)

FastPitch is a fully - parallel transformer architecture for text - to - speech. It offers prosody control over pitch and individual phoneme duration and uses an unsupervised speech - text aligner. This model is also compatible with NVIDIA Riva for production - grade server deployments.

| | | | |

🚀 Quick Start

The model is available for use in the NeMo toolkit and can be used as a pre - trained checkpoint for inference or for fine - tuning on another dataset. To train, fine - tune or play with the model, you need to install NVIDIA NeMo. We recommend installing it after you've installed the latest PyTorch version.

pip install nemo_toolkit['all']

✨ Features

A fully - parallel transformer architecture with prosody control over pitch and individual phoneme duration.
Uses an unsupervised speech - text aligner.
Compatible with NVIDIA Riva for production - grade server deployments.

📦 Installation

To use this model, you need to install NVIDIA NeMo. We suggest installing it after the latest PyTorch version. You can install it using the following command:

pip install nemo_toolkit['all']

💻 Usage Examples

Basic Usage

Note: This model generates only spectrograms and a vocoder is needed to convert the spectrograms to waveforms. In this example HiFiGAN is used.

# Load FastPitch
from nemo.collections.tts.models import FastPitchModel
spec_generator = FastPitchModel.from_pretrained("nvidia/tts_en_fastpitch")

# Load vocoder
from nemo.collections.tts.models import HifiGanModel
model = HifiGanModel.from_pretrained(model_name="nvidia/tts_hifigan")

Advanced Usage

Generate audio

import soundfile as sf
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = model.convert_spectrogram_to_audio(spec=spectrogram)

Save the generated audio file

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 22050)

Input

This model accepts batches of text.

Output

This model generates mel spectrograms.

📚 Documentation

Model Architecture

FastPitch is a fully - parallel text - to - speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully - parallel Transformer architecture, with a much higher real - time factor than Tacotron2 for the mel - spectrogram synthesis of a typical utterance. It uses an unsupervised speech - text aligner.

Training

The NeMo toolkit [3] was used for training the models for 1000 epochs. These model are trained with this example script and this base config.

Datasets

This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.

Performance

No performance information is available at this time.

Limitations

This checkpoint only works well with vocoders that were trained on 22050Hz data. Otherwise, the generated audio may be scratchy or choppy - sounding.

Deployment with NVIDIA Riva

For the best real - time accuracy, latency, and throughput, deploy the model with NVIDIA Riva, an accelerated speech AI SDK deployable on - prem, in all clouds, multi - cloud, hybrid, at the edge, and embedded. Additionally, Riva provides:

World - class out - of - the - box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU - compute hours
Best in class accuracy with run - time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
Streaming speech recognition, Kubernetes compatible scaling, and Enterprise - grade support Check out Riva live demo.

📄 License

This project is licensed under the cc - by - 4.0 license.

🔗 References

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご