The open-source wav2vec2-base-960h model: Power your English automatic speech recognition tasks for free

Wav2vec2 Base 960h

Developed by facebook

The Wav2Vec2 base model developed by Facebook, pre-trained and fine-tuned on 960 hours of LibriSpeech audio for English automatic speech recognition tasks.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #High-precision speech recognition #English speech-to-text #Low-resource adaptation

Downloads 2.1M

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model capable of converting English speech into text. It is pre-trained and fine-tuned on the LibriSpeech dataset and supports audio input with a 16kHz sampling rate.

Model Features

Efficient speech recognition

Achieves a 3.4% word error rate (WER) on the LibriSpeech clean test set, demonstrating excellent performance.

High performance with limited labeled data

Using only ten minutes of labeled data and 53k hours of unlabeled data for pre-training, it still achieves a WER of 4.8/8.2.

16kHz sampling rate support

The model is optimized for audio with a 16kHz sampling rate. Ensure input audio meets this specification when using the model.

Model Capabilities

English speech recognition

Audio-to-text conversion

Automatic speech transcription

Use Cases

Speech transcription

Meeting minutes

Automatically convert meeting recordings into text transcripts

Highly accurate transcription results

Podcast transcription

Convert English podcast content into searchable text

Facilitates content retrieval and analysis

Assistive technology

Voice input system

Provides speech-to-text functionality for people with disabilities

Improves accessibility

🚀 Wav2Vec2-Base-960h

A pre - trained and fine - tuned base model on 960 hours of Librispeech for automatic speech recognition

This is a base model that has been pre - trained and fine - tuned on 960 hours of Librispeech on 16kHz sampled speech audio. Ensure that your speech input is also sampled at 16Khz when using the model.

Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

Abstract

For the first time, we demonstrate that learning powerful representations from speech audio alone and then fine - tuning on transcribed speech can outperform the best semi - supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When reducing the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state - of - the - art on the 100 - hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre - training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This shows the feasibility of speech recognition with limited amounts of labeled data.

The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

🚀 Quick Start

✨ Features

Datasets: Utilizes the librispeech_asr dataset.
Tags: Related to audio, automatic - speech - recognition, and hf - asr - leaderboard.
License: Licensed under the Apache 2.0 license.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

To transcribe audio files, the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate facebook/wav2vec2-base-960h on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
3.4	8.6

📚 Documentation

The model's performance on different datasets is as follows:

Property	Details
Model Type	Wav2Vec2 - Base - 960h
Training Data	Librispeech (960 hours on 16kHz sampled speech audio)

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご