đ XLSR Wav2Vec2 English by Jonatas Grosman
This project presents a fine - tuned XLSR - 53 large model for English speech recognition. It addresses the need for accurate automatic speech recognition in English, leveraging pre - trained models and fine - tuning on specific datasets to achieve high - quality results.
đ Quick Start
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on English, using the train and validation splits of Common Voice 6.1. When using this model, ensure that your speech input is sampled at 16kHz.
This model has been fine - tuned thanks to the GPU credits generously given by the OVHcloud.
The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
⨠Features
- Datasets: Utilizes
common_voice
and mozilla - foundation/common_voice_6_0
for training and evaluation.
- Metrics: Evaluated using Word Error Rate (WER) and Character Error Rate (CER).
- Task: Focuses on the Automatic Speech Recognition task.
đĻ Installation
There is no specific installation command provided in the original README. However, to use the model, you need to install relevant libraries such as huggingsound
, torch
, librosa
, datasets
, and transformers
. You can install them using pip
:
pip install huggingsound torch librosa datasets transformers
đģ Usage Examples
Basic Usage
Using the HuggingSound library:
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)
Advanced Usage
Writing your own inference script:
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
SAMPLES = 10
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
for i, predicted_sentence in enumerate(predicted_sentences):
print("-" * 100)
print("Reference:", test_dataset[i]["sentence"])
print("Prediction:", predicted_sentence)
Prediction Results
Reference |
Prediction |
"SHE'LL BE ALL RIGHT." |
SHE'LL BE ALL RIGHT |
SIX |
SIX |
"ALL'S WELL THAT ENDS WELL." |
ALL AS WELL THAT ENDS WELL |
DO YOU MEAN IT? |
DO YOU MEAN IT |
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. |
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESSION |
HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? |
HOW IS MOSLILLAR GOING TO HANDLE ANDBEWOOTH HIS LIKE Q AND Q |
"I GUESS YOU MUST THINK I'M KINDA BATTY." |
RUSTIAN WASTIN PAN ONTE BATTLY |
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? |
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. |
SAUCE FOR THE GUICE IS SAUCE FOR THE GONDER |
GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. |
GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD |
đ Documentation
Evaluation
- To evaluate on
mozilla - foundation/common_voice_6_0
with split test
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test
- To evaluate on
speech - recognition - community - v2/dev_data
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0
đ License
This model is licensed under the apache - 2.0
license.
đ Model Index
Property |
Details |
Model Name |
XLSR Wav2Vec2 English by Jonatas Grosman |
Task |
Automatic Speech Recognition |
Datasets |
Common Voice en, Robust Speech Event - Dev Data |
Metrics |
Test WER, Test CER, Test WER (+LM), Test CER (+LM), Dev WER, Dev CER, Dev WER (+LM), Dev CER (+LM) |
Results |
See the following table for detailed metric values |
Results
Task |
Dataset |
Metric |
Value |
Automatic Speech Recognition |
Common Voice en |
Test WER |
19.06 |
Automatic Speech Recognition |
Common Voice en |
Test CER |
7.69 |
Automatic Speech Recognition |
Common Voice en |
Test WER (+LM) |
14.81 |
Automatic Speech Recognition |
Common Voice en |
Test CER (+LM) |
6.84 |
Automatic Speech Recognition |
Robust Speech Event - Dev Data |
Dev WER |
27.72 |
Automatic Speech Recognition |
Robust Speech Event - Dev Data |
Dev CER |
11.65 |
Automatic Speech Recognition |
Robust Speech Event - Dev Data |
Dev WER (+LM) |
20.85 |
Automatic Speech Recognition |
Robust Speech Event - Dev Data |
Dev CER (+LM) |
11.01 |
đ Citation
If you want to cite this model you can use this:
@misc{grosman2021xlsr53-large-english,
title={Fine-tuned {XLSR}-53 large model for speech recognition in {E}nglish},
author={Grosman, Jonatas},
howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english}},
year={2021}
}