Open-source English speech recognition model wav2vec2-large-xlsr-53-english

Wav2vec2 Large Xlsr 53 English

Developed by jonatasgrosman

An English speech recognition model fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, trained on the Common Voice 6.1 dataset

Speech Recognition EnglishOpen Source License:Apache-2.0 #English speech recognition #XLSR fine-tuning #Low word error rate

Downloads 251.78k

Release Time : 3/2/2022

Model Overview

This is a fine-tuned XLSR-53 large model for English speech recognition tasks, capable of converting English speech to text

Model Features

High-performance English speech recognition

Achieves 19.06% word error rate and 7.69% character error rate on the Common Voice test set

Language model enhancement support

With a language model, the word error rate can be reduced to 14.81% and character error rate to 6.84%

16kHz sampling rate support

Optimized for 16kHz sampled speech input

Based on XLSR-53 pre-trained model

Leverages the advantages of large-scale cross-lingual speech representation (XLSR) pre-training

Model Capabilities

English speech recognition

Speech-to-text conversion

Supports long audio processing (via chunking)

Use Cases

Speech transcription

Automatic meeting transcription

Automatically converts English meeting recordings into text transcripts

Approximately 80.94% accuracy (based on WER)

Voice note conversion

Converts personal voice memos into searchable text

Assistive technology

Real-time caption generation

Generates real-time captions for English videos or live streams

🚀 XLSR Wav2Vec2 English by Jonatas Grosman

This project presents a fine - tuned XLSR - 53 large model for English speech recognition. It addresses the need for accurate automatic speech recognition in English, leveraging pre - trained models and fine - tuning on specific datasets to achieve high - quality results.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on English, using the train and validation splits of Common Voice 6.1. When using this model, ensure that your speech input is sampled at 16kHz.

This model has been fine - tuned thanks to the GPU credits generously given by the OVHcloud.

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

✨ Features

Datasets: Utilizes common_voice and mozilla - foundation/common_voice_6_0 for training and evaluation.
Metrics: Evaluated using Word Error Rate (WER) and Character Error Rate (CER).
Task: Focuses on the Automatic Speech Recognition task.

📦 Installation

There is no specific installation command provided in the original README. However, to use the model, you need to install relevant libraries such as huggingsound, torch, librosa, datasets, and transformers. You can install them using pip:

pip install huggingsound torch librosa datasets transformers

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Prediction Results

Reference	Prediction
"SHE'LL BE ALL RIGHT."	SHE'LL BE ALL RIGHT
SIX	SIX
"ALL'S WELL THAT ENDS WELL."	ALL AS WELL THAT ENDS WELL
DO YOU MEAN IT?	DO YOU MEAN IT
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS.	THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESSION
HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE?	HOW IS MOSLILLAR GOING TO HANDLE ANDBEWOOTH HIS LIKE Q AND Q
"I GUESS YOU MUST THINK I'M KINDA BATTY."	RUSTIAN WASTIN PAN ONTE BATTLY
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING?	NO ONE NEAR THE REMOTE MACHINE YOU COULD RING
SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER.	SAUCE FOR THE GUICE IS SAUCE FOR THE GONDER
GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD.	GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD

📚 Documentation

Evaluation

To evaluate on mozilla - foundation/common_voice_6_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test

To evaluate on speech - recognition - community - v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0

📄 License

This model is licensed under the apache - 2.0 license.

📚 Model Index

Property	Details
Model Name	XLSR Wav2Vec2 English by Jonatas Grosman
Task	Automatic Speech Recognition
Datasets	Common Voice en, Robust Speech Event - Dev Data
Metrics	Test WER, Test CER, Test WER (+LM), Test CER (+LM), Dev WER, Dev CER, Dev WER (+LM), Dev CER (+LM)
Results	See the following table for detailed metric values

Results

Task	Dataset	Metric	Value
Automatic Speech Recognition	Common Voice en	Test WER	19.06
Automatic Speech Recognition	Common Voice en	Test CER	7.69
Automatic Speech Recognition	Common Voice en	Test WER (+LM)	14.81
Automatic Speech Recognition	Common Voice en	Test CER (+LM)	6.84
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev WER	27.72
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev CER	11.65
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev WER (+LM)	20.85
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev CER (+LM)	11.01

📖 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr53-large-english,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {E}nglish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご