Open-source Wav2Vec2-base-960h Speech Recognition Model - Free Support for English Speech-to-Text

Wav2vec2 Base 960h

Developed by tommy19970714

Wav2Vec2 is a self-supervised learning-based speech recognition model developed by Facebook, trained on the LibriSpeech dataset, supporting English speech-to-text tasks.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #High-precision speech recognition #English speech transcription #End-to-end acoustic model

Downloads 19

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system capable of converting English speech into text. Based on the Transformer architecture, it was trained on 960 hours of LibriSpeech data.

Model Features

Self-supervised Learning

Uses self-supervised learning for pre-training, reducing reliance on manually annotated data

High Accuracy

Achieves a word error rate (WER) of 3.4% (clean) and 8.6% (other) on the LibriSpeech test set

End-to-end Training

Learns directly from raw audio without requiring separate components found in traditional speech recognition systems

Model Capabilities

English speech recognition

Audio-to-text conversion

Speech transcription

Use Cases

Speech Transcription

Meeting Minutes

Automatically transcribes meeting recordings

Accuracy depends on audio quality, reaching up to 96.6% on clear speech

Podcast Transcription

Converts podcast content into text

Assistive Technology

Real-time Caption Generation

Generates real-time captions for videos or live streams

🚀 Wav2Vec2-Base-960h

This repository reimplements Facebook's official wav2vec, aiming to provide a clear way to convert the pretrain model to a pytorch.bin file.

🚀 Quick Start

This repository is a re - implementation of official Facebook’s wav2vec. There is no description of converting the wav2vec pretrain model to a pytorch.bin file. We are rebuilding pytorch.bin from the pretrain model. Here is the conversion method.

pip install transformers[sentencepiece]
pip install fairseq -U

git clone https://github.com/huggingface/transformers.git
cp transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py .

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt -O ./wav2vec_small_960h.pt
mkdir dict
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt

mkdir outputs
python convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path ./outputs --checkpoint_path ./wav2vec_small_960h.pt --dict_path ./dict

✨ Features

Dataset Support: Supports the librispeech_asr dataset.
Widget Examples: Provides two audio examples from Librispeech for quick testing.
License: Licensed under the apache - 2.0 license.

Property	Details
Datasets	librispeech_asr
Tags	audio, automatic - speech - recognition
License	apache - 2.0
Widget Examples	Librispeech sample 1, Librispeech sample 2

📦 Installation

The installation steps are included in the conversion method above. You need to install the necessary libraries and clone the relevant repositories.

💻 Usage Examples

Basic Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# load model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# define function to read in sound file
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

# tokenize
input_values = tokenizer(ds["speech"][:2], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate facebook/wav2vec2-base-960h on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import soundfile as sf
import torch
from jiwer import wer


librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    input_values = tokenizer(batch["speech"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
3.4	8.6

📚 Documentation

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご