wav2vec2-large-robust-ft-libri-960h Open-Source Speech Recognition Model - Strong Robustness and Precise Speech Recognition

Wav2vec2 Large Robust Ft Libri 960h

Developed by facebook

This model is a fine-tuned version of Facebook's Wav2Vec2, specializing in speech recognition tasks. It was pre-trained on various speech data and fine-tuned on Librispeech, featuring strong robustness.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Multi-domain speech recognition #Robust audio processing #16kHz sampling rate adaptation

Downloads 161.65k

Release Time : 3/2/2022

Model Overview

This is an Automatic Speech Recognition (ASR) model based on the wav2vec2-large-robust architecture. It was pre-trained on diverse speech data and fine-tuned on 960 hours of Librispeech data, suitable for English speech-to-text tasks.

Model Features

Multi-domain Pre-training

The model was pre-trained on various speech data, including read speech (Libri-Light), crowdsourced speech (CommonVoice), and telephone speech (Switchboard/Fisher), enhancing its robustness.

Target Domain Fine-tuning

Fine-tuned on 960 hours of Librispeech read speech data, improving recognition accuracy in read speech scenarios.

Strong Robustness

Specifically designed to handle speech data from different domains, performing well on both in-domain and out-of-domain data, reducing performance gaps by 66%-73%.

Model Capabilities

English speech recognition

Read speech transcription

Telephone speech transcription

Crowdsourced speech transcription

Use Cases

Speech Transcription

Audiobook Transcription

Convert read audiobook audio into text

Performs well on the Librispeech test set

Telephone Speech Transcription

Transcribe telephone call content

Performs well on Switchboard and Fisher datasets

Voice Assistants

Voice Command Recognition

Recognize user voice commands and convert them to text

Suitable for various speech environments

🚀 Wav2Vec2-Large-Robust finetuned on Librispeech

This model is a fine - tuned version of the wav2vec2 - large - robust model, which can be used for automatic speech recognition. It has been pre - trained on multiple datasets and fine - tuned on Librispeech.

🚀 Quick Start

This model is a fine - tuned version of the [wav2vec2 - large - robust](https://huggingface.co/facebook/wav2vec2 - large - robust) model. It has been pretrained on several datasets:

[Libri - Light](https://github.com/facebookresearch/libri - light): open - source audio books from the LibriVox project; clean, read - out audio data
CommonVoice: crowd - source collected audio data; read - out text snippets
Switchboard: telephone speech corpus; noisy telephone data
Fisher: conversational telephone speech; noisy telephone data

And subsequently been finetuned on 960 hours of Librispeech, which is open - source read - out audio data.

When using the model, make sure that your speech input is also sampled at 16Khz.

The related paper is Robust Wav2Vec2.

Authors: Wei - Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

Abstract Self - supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre - training data differs from the domain of the labeled data for fine - tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre - training leads to large performance improvements across a variety of setups. On a large - scale competitive setup, we show that pre - training on unlabeled in - domain data reduces the gap between models trained on in - domain and out - of - domain labeled data by 66% - 73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre - training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at this https URL.

The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec - 20.

✨ Features

Multi - domain pre - training: Pretrained on various datasets including Libri - Light, CommonVoice, Switchboard, and Fisher, which helps the model handle different types of audio data.
Fine - tuning on Librispeech: Finetuned on 960 hours of Librispeech data, improving its performance on open - source read - out audio data.

📚 Documentation

Datasets

Property	Details
Model Type	Wav2Vec2 - Large - Robust finetuned on Librispeech
Training Data	Pretrained on Libri - Light, CommonVoice, Switchboard, Fisher; Finetuned on Librispeech

Important Note

⚠️ Important Note

When using the model, make sure that your speech input is also sampled at 16Khz.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-robust-ft-libri-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-robust-ft-libri-960h")

# define function to read in sound file
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch
    
# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

# tokenize
input_values = processor(ds["speech"][:2], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

📄 License

This model is licensed under the [apache - 2.0](https://www.apache.org/licenses/LICENSE - 2.0) license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご