wav2vec2-xls-r-1b-english Open-source English Speech Recognition Model

Wav2vec2 Xls R 1b English

Developed by jonatasgrosman

This is an English speech recognition model based on the XLS-R 1B architecture, fine-tuned on multiple English speech datasets.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #English speech recognition #High-precision WER #Multi-dataset training

Downloads 1,896

Release Time : 3/2/2022

Model Overview

This model is optimized for English speech recognition tasks, capable of converting English speech to text.

Model Features

Multi-dataset training

Trained using multiple datasets including Common Voice 8.0, Multilingual LibriSpeech, TED-LIUMv3, and Voxpopuli

High performance

Achieves 21.05% WER and 8.44% CER on the Common Voice 8 test set

Language model support

Can be used in conjunction with a language model (LM) to further improve recognition accuracy

Model Capabilities

English speech recognition

Real-time speech-to-text

Supports 16kHz sampling rate audio processing

Use Cases

Speech transcription

Meeting minutes

Automatically convert English meeting recordings into text transcripts

Approximately 80% accuracy (WER 20%)

Podcast transcription

Convert English podcast content into text transcripts

Assistive technology

Voice input system

Provide voice input solutions for people with disabilities

🚀 Fine-tuned XLS-R 1B model for speech recognition in English

This is a fine-tuned model based on facebook/wav2vec2-xls-r-1b for English speech recognition, leveraging multiple datasets. It offers high - performance speech recognition capabilities.

✨ Features

Multilingual Datasets Utilization: Fine - tuned on Common Voice 8.0, Multilingual LibriSpeech, TED - LIUMv3, and Voxpopuli for better generalization.
Multiple Metrics Evaluation: Evaluated on multiple datasets with metrics like WER (Word Error Rate) and CER (Character Error Rate), both with and without a language model.

Property	Details
Model Type	Fine - tuned XLS - R 1B for English speech recognition
Training Data	mozilla - foundation/common_voice_8_0, Multilingual LibriSpeech, TED - LIUMv3, Voxpopuli

Model Performance

Task	Dataset	WER	CER	WER (+LM)	CER (+LM)
Automatic Speech Recognition	Common Voice 8 (Test)	21.05	8.44	17.31	7.77
Automatic Speech Recognition	Robust Speech Event - Dev Data	20.53	9.31	17.7	8.93
Automatic Speech Recognition	Robust Speech Event - Test Data	17.88	-	-	-

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz. This model has been fine - tuned by the HuggingSound tool, and thanks to the GPU credits generously given by the OVHcloud.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

📚 Documentation

Evaluation Commands

To evaluate on mozilla - foundation/common_voice_8_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-english --dataset mozilla-foundation/common_voice_8_0 --config en --split test

To evaluate on speech - recognition - community - v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr-1b-english,
  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {E}nglish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-english}},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご