Wav2vec2-base-100h Open-source Speech Recognition Model - Achieve Efficient Automatic Speech Recognition for Free

Wav2vec2 Base 100h

Developed by facebook

Wav2Vec2 Base is an automatic speech recognition model pre-trained and fine-tuned on 16kHz sampled LibriSpeech audio for 100 hours.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Low-resource speech recognition #16kHz audio processing #LibriSpeech optimization

Downloads 4,380

Release Time : 3/2/2022

Model Overview

This model achieves efficient speech recognition by learning powerful representations from speech audio and fine-tuning, particularly suitable for scenarios with limited annotated data.

Model Features

Efficient Speech Representation Learning

Learns powerful speech representations through latent space masking and quantization contrastive tasks.

Low Annotation Data Requirement

Achieves high performance with limited annotated data, surpassing previous state-of-the-art with just 1 hour of labeled data compared to 100-hour subsets.

High Accuracy

Achieves word error rates (WER) of 1.8/3.3 on the LibriSpeech test set.

Model Capabilities

Speech recognition

Audio-to-text conversion

English speech processing

Use Cases

Speech Transcription

Automatic Meeting Minutes Generation

Automatically converts meeting recordings into text transcripts

Word error rate of 6.1% on clean test set

Voice Assistants

Used as the speech recognition module for voice assistants

Word error rate of 13.5% on other test sets

Education

Language Learning Applications

Helps language learners practice pronunciation and listening

🚀 Wav2Vec2-Base-100h

A pre - trained and fine - tuned base model for automatic speech recognition on 100 hours of Librispeech audio.

This is the base model that has been pre - trained and fine - tuned on 100 hours of Librispeech, with 16kHz sampled speech audio. When using this model, ensure that your speech input is also sampled at 16kHz.

Facebook's Wav2Vec2 Paper

Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

📚 Documentation

Abstract

We show for the first time that learning powerful representations from speech audio alone followed by fine - tuning on transcribed speech can outperform the best semi - supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre - training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

 from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 from datasets import load_dataset
 import soundfile as sf
 import torch
 
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-100h")
 model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-100h")
 
 # define function to read in sound file
 def map_to_array(batch):
     speech, _ = sf.read(batch["file"])
     batch["speech"] = speech
     return batch
     
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 ds = ds.map(map_to_array)
 
 # tokenize
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)

Advanced Usage

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import soundfile as sf
import torch
from jiwer import wer


librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-100h")

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])

print("WER:", wer(result["text"], result["transcription"]))

📊 Evaluation Results

Property	Details
WER ("clean")	6.1
WER ("other")	13.5

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご