Wav2vec2-large-robust-ft-libritts-voxpopuli Open-source Speech Recognition Model - Generate Punctuated Transcription Texts to Aid TTS Construction

Wav2vec2 Large Robust Ft Libritts Voxpopuli

Developed by jbetker

A speech recognition model based on wav2vec2-large, specifically designed to generate transcribed text with punctuation, suitable for TTS model construction.

Speech Recognition

Transformers

#TTS prosody optimization #Punctuation transcription #Clean audio adaptation

Downloads 339.01k

Release Time : 3/2/2022

Model Overview

This model fine-tunes the facebook/wav2vec2-large-robust-ft-libri-960h checkpoint by adding a punctuation vocabulary, focusing on generating transcribed text with punctuation, especially suitable for TTS applications requiring prosody.

Model Features

Punctuation generation

Designed to generate transcribed text with punctuation, crucial for the prosody performance of TTS models.

High accuracy

Achieves a 4.45% word error rate (WER) on the librispeech validation set, close to the baseline model's 4.3%.

Clean audio optimization

Fine-tuned on clean audio datasets like libritts and voxpopuli, suitable for high-quality audio transcription.

Model Capabilities

Speech-to-text

Punctuation insertion

High-quality audio transcription

Use Cases

Text-to-speech (TTS)

TTS model transcription construction

Generates transcribed text with punctuation for TTS models to enhance prosody performance.

Improves the naturalness and expressiveness of TTS output.

Speech transcription

High-quality audio transcription

Suitable for transcription tasks on clean audio like libritts.

4.45% word error rate (WER).

🚀 Wav2Vec2-Large Model for Transcriptions with Punctuation

This checkpoint is a wav2vec2-large model designed to generate transcriptions with punctuation. It's particularly useful for building transcriptions for TTS models, where punctuation plays a crucial role in prosody.

🚀 Quick Start

This wav2vec2-large model is fine - tuned to generate punctuated transcriptions, which are essential for TTS models to achieve better prosody.

✨ Features

Punctuation Generation: Capable of generating transcriptions with punctuation, which is beneficial for TTS prosody.
Fine - Tuned on Specific Datasets: Fine - tuned on the libritts and voxpopuli datasets with a new punctuation - included vocabulary.
Respectable WER: Achieves a WER of 4.45% on the librispeech validation set, comparable to the baseline.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Check out the speech transcription script repo, ocotillo for usage examples: https://github.com/neonbjb/ocotillo

📚 Documentation

Model Creation

This model was created by fine - tuning the facebook/wav2vec2-large-robust-ft-libri-960h checkpoint on the libritts and voxpopuli datasets. A new vocabulary that includes punctuation was used during the fine - tuning process.

Performance

The model gets a WER of 4.45% on the librispeech validation set, while the baseline facebook/wav2vec2-large-robust-ft-libri-960h got 4.3%.

Limitation

Since the model was fine - tuned on clean audio, it is not well - suited for noisy audio like CommonVoice. However, it still performs reasonably well.

Vocabulary

The vocabulary is uploaded to the model hub as jbetker/tacotron_symbols.

🔧 Technical Details

The model is based on the wav2vec2 - large architecture. Fine - tuning was performed on specific datasets with a custom vocabulary to enable punctuation generation. The performance comparison with the baseline shows its effectiveness in generating punctuated transcriptions.

📄 License

No license information is provided in the original document.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご