đ Speech Recognition Model Evaluation
This project focuses on evaluating a speech recognition model on the Italian Common Voice dataset. It uses the Wav2Vec2
architecture for Automatic Speech Recognition (ASR) and provides a Python script to calculate the Word Error Rate (WER).
đ Quick Start
Prerequisites
Make sure you have the necessary libraries installed. You can install them using pip
:
pip install torchaudio datasets transformers torch
Run the Evaluation Script
The following Python script evaluates the model on the Italian Common Voice test dataset and calculates the WER.
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
)
import torch
import re
import sys
model_name = "facebook/wav2vec2-large-xlsr-53-italian"
device = "cuda"
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)
ds = load_dataset("common_voice", "it", split="test", data_dir="./cv-corpus-6.1-2020-12-11")
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)
def map_to_array(batch):
speech, _ = torchaudio.load(batch["path"])
batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
batch["sampling_rate"] = resampler.new_freq
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("â", "'")
return batch
ds = ds.map(map_to_array)
def map_to_pred(batch):
features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)
batch["target"] = batch["sentence"]
return batch
result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))
wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))
Result
The Word Error Rate (WER) of the model on the Italian Common Voice test dataset is 22.1 %.
⨠Features
- Speech Recognition: Utilizes the
Wav2Vec2
architecture for automatic speech recognition.
- Dataset Integration: Works with the Italian Common Voice dataset.
- Evaluation Metric: Calculates the Word Error Rate (WER) to measure the performance of the model.
đĻ Installation
To install the required libraries, run the following command:
pip install torchaudio datasets transformers torch
đģ Usage Examples
Basic Usage
The provided Python script is a complete example of evaluating the model on the Italian Common Voice test dataset. You can run it directly after installing the necessary libraries.
Advanced Usage
You can modify the script to use different models or datasets. For example, you can change the model_name
variable to use a different pre-trained model or change the load_dataset
parameters to use a different dataset.
model_name = "another_model_name"
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)
ds = load_dataset("another_dataset", "another_language", split="test", data_dir="./another_dataset_dir")
đ Documentation
Wav2Vec2ForCTC
: A model for Connectionist Temporal Classification (CTC) based on the Wav2Vec2
architecture.
Wav2Vec2Processor
: A processor that can be used to preprocess audio data and decode model outputs.
load_dataset
: A function from the datasets
library to load a dataset.
load_metric
: A function from the datasets
library to load an evaluation metric.
đ License
This project is licensed under the Apache 2.0 license.