The open-source wav2vec2-large-960h-lv60-self-4-gram speech recognition model

Wav2vec2 Large 960h Lv60 Self 4 Gram

Developed by patrickvonplaten

Based on Facebook's Wav2Vec2-Large-960h-lv60-self model, enhanced with an English 4-gram language model to improve speech recognition accuracy

Speech Recognition EnglishOpen Source License:Apache-2.0 #High-precision speech recognition #English speech transcription #4-gram language model

Downloads 22

Release Time : 4/12/2022

Model Overview

This is an automatic speech recognition (ASR) model specifically designed for English speech-to-text tasks, significantly improving recognition accuracy through the integration of a 4-gram language model.

Model Features

4-gram language model integration

Incorporates the official Librispeech 4-gram language model, significantly improving speech recognition accuracy

High-performance recognition

Achieves word error rates (WER) of 1.84 (clean) and 3.71 (other) on the LibriSpeech test set

Based on Wav2Vec2 architecture

Utilizes Facebook's advanced Wav2Vec2-Large-960h-lv60-self architecture

Model Capabilities

English speech recognition

High-accuracy speech-to-text conversion

Processing 16kHz sampling rate audio

Use Cases

Speech transcription

Audiobook transcription

Automatically transcribes English audiobook content into text

Achieves a word error rate of only 1.84 (clean) on the LibriSpeech test set

Meeting minutes

Automatically records English meeting content and generates transcripts

Achieves a word error rate of 3.71 on non-standard speech (other) test sets

🚀 Wav2Vec2-Base-960h + 4-gram

This model is an enhanced version of Facebook's Wav2Vec2-Large-960h-lv60-self, augmented with an English 4-gram. It utilizes the 4-gram.arpa.gz from Librispeech's official ngrams.

✨ Features

Audio Processing: Specialized in audio tasks, particularly automatic speech recognition.
Enhanced with 4-gram: Incorporates an English 4-gram for improved performance.
High Performance: Achieves low Word Error Rate (WER) on LibriSpeech datasets.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torch
from jiwer import wer

model_id = "patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram"

librispeech_eval = load_dataset("librispeech_asr", "other", split="test")

model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")

    inputs = {k: v.to("cuda") for k,v in inputs.items()}

    with torch.no_grad():
        logits = model(**inputs).logits

    transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print(wer(result["text"], result["transcription"]))

Advanced Usage

The basic usage code can be adjusted according to different requirements, such as using different datasets or adjusting model parameters.

📚 Documentation

Evaluation

This section demonstrates how to evaluate the patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram model on LibriSpeech's "clean" and "other" test data.

Results

"clean"	"other"
1.84	3.71

🔧 Technical Details

Model Type: Based on Wav2Vec2 architecture, enhanced with an English 4-gram.
Training Data: Utilizes the LibriSpeech ASR dataset.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご