The open-source speech recognition model wav2vec2-large-xlsr-53-es

Wav2vec2 Large Xlsr 53 Es

Developed by pcuenq

A speech recognition model fine-tuned on the Spanish Common Voice dataset based on Facebook's wav2vec2-large-xlsr-53 model, with a test WER of 10.50%.

Speech Recognition

Transformers

SpanishOpen Source License:Apache-2.0 #Spanish speech recognition #Low WER #XLSR fine-tuning

Downloads 147

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Spanish, capable of converting Spanish speech into text.

Model Features

Low word error rate

Achieves a WER of 10.50% on the Common Voice Spanish test set.

Preserves diacritics

Retains diacritical marks in Spanish to ensure semantic accuracy.

No language model required

Can be used directly without additional language model support.

Multi-stage training

Employs a phased training strategy to progressively optimize model performance.

Model Capabilities

Spanish speech recognition

16kHz audio processing

Batch speech-to-text conversion

Use Cases

Speech transcription

Spanish speech-to-text

Convert Spanish speech content into text format

Approximately 89.5% accuracy (WER 10.5%)

Voice assistants

Spanish voice command recognition

Basic recognition component for Spanish voice assistants

🚀 Wav2Vec2-Large-XLSR-53-Spanish

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Spanish using the Common Voice dataset. It's designed for automatic speech recognition tasks.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Language Adaptation: Fine - tuned specifically for the Spanish language using the Common Voice dataset.
Efficient Processing: Can be used directly without a language model.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "es", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Spanish test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "es", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model.to("cuda")

## Text pre-processing

chars_to_ignore_regex = '[\,\¿\?\.\¡\!\-\;\:\"\“\%\‘\”\\…\’\ː\'\‹\›\`\´\®\—\→]'
chars_to_ignore_pattern = re.compile(chars_to_ignore_regex)

def remove_special_characters(batch):
    batch["sentence"] = chars_to_ignore_pattern.sub('', batch["sentence"]).lower() + " "
    return batch

def replace_diacritics(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ì', 'í', sentence)
    sentence = re.sub('ù', 'ú', sentence)
    sentence = re.sub('ò', 'ó', sentence)
    sentence = re.sub('à', 'á', sentence)
    batch["sentence"] = sentence
    return batch

def replace_additional(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ã', 'a', sentence)   # Portuguese, as in São Paulo
    sentence = re.sub('ō', 'o', sentence)   # Japanese
    sentence = re.sub('ê', 'e', sentence)   # Português
    batch["sentence"] = sentence
    return batch

## Audio pre-processing

# I tried to perform the resampling using a `torchaudio` `Resampler` transform,
# but found that the process deadlocked when using multiple processes.
# Perhaps my torchaudio is using the wrong sox library under the hood, I'm not sure.
# Fortunately, `librosa` seems to work fine, so that's what I'll use for now.

import librosa
def speech_file_to_array_fn(batch):
    speech_array, sample_rate = torchaudio.load(batch["path"])
    batch["speech"] = librosa.resample(speech_array.squeeze().numpy(), sample_rate, 16_000)
    return batch

# One-pass mapping function

# Text transformation and audio resampling
def cv_prepare(batch):
    batch = remove_special_characters(batch)
    batch = replace_diacritics(batch)
    batch = replace_additional(batch)
    batch = speech_file_to_array_fn(batch)
    return batch

# Number of CPUs or None
num_proc = 16

test_dataset = test_dataset.map(cv_prepare, remove_columns=['path'], num_proc=num_proc)

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

# WER Metric computation
# `wer.compute` crashes in my computer with more than ~10000 samples.
# Until I confirm in a different one, I created a "chunked" version of the computation.
# It gives the same results as `wer.compute` for smaller datasets.

import jiwer

def chunked_wer(targets, predictions, chunk_size=None):                                          
    if chunk_size is None: return jiwer.wer(targets, predictions)                                
    start = 0                                                                                    
    end = chunk_size                                                                             
    H, S, D, I = 0, 0, 0, 0                                                                      
    while start < len(targets):                                                                  
        chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])       
        H = H + chunk_metrics["hits"]                                                            
        S = S + chunk_metrics["substitutions"]                                                   
        D = D + chunk_metrics["deletions"]                                                       
        I = I + chunk_metrics["insertions"]                                                      
        start += chunk_size                                                                      
        end += chunk_size                                                                        
    return float(S + D + I) / float(H + S + D)

print("WER: {:2f}".format(100 * chunked_wer(result["sentence"], result["pred_strings"], chunk_size=4000)))
#print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 10.50 %

Text processing

The Common Voice es dataset has a lot of characters that don't belong to the Spanish language, even after discarding separators and punctuators. Some translations were made and most of the extraneous characters were discarded.

The decision was made to keep all the Spanish language diacritics. This is a difficult choice. Sometimes diacritics are added just because of orthography rules and don't alter the meaning of the word. In other cases, however, they carry meaning as they disambiguate among different senses. A better WER score would surely have been achieved using just the non - accented characters, and the resulting text would be understood by Spanish speakers. Nevertheless, it is considered "more correct" to keep them.

All the rules applied are shown in the evaluation script.

Training

The Common Voice train and validation datasets were used for training.

For dataset handling reasons, the train + validation datasets were initially split into 10% splits so that progress could be monitored earlier and adjustments could be made if needed.

Trained for 30 epochs on the first split only, using similar values as those proposed by Patrick in his demo notebook. A batch_size of 24 with 2 gradient accumulation steps was used. This gave a WER of about 16.3% on the full test set.
Then trained the resulting model on the 9 remaining splits, for 3 epochs each, but with a faster warmup of 75 steps.
Next, trained 3 epochs on each of the 10 splits using a smaller learning rate of 1e - 4. A warmup of 75 steps was used in this case too. The final model had a WER of about 11.7%.
By this time, the reason for the initial delay in training time was identified, and the full dataset was used for training. A cosine schedule with hard restarts, a reference learning rate of 3e - 5 and 10 epochs were selected. The cosine schedule was configured to have 10 cycles, and no warmup was used. This produced a WER of ~10.5%.

Other things I tried

Starting from the same fine - tuned model, a constant lr of 1e - 4 was compared against a linear schedule with warmup. The linear schedule worked better (11.85 vs 12.72 WER%).
Attempted to use a Spanish model to improve a Basque one. The text was transformed to make orthography more similar to the target language, but the Basque model did not improve.
Label smoothing did not work.

Issues and other technical challenges

Previously, the transformers library was used as an end - user just to try Bert on some tasks. This is the first time the code has been examined.

The Datasets abstraction is great because, being based on memory - mapped files, it allows arbitrarily - sized datasets to be processed. However, it's important to understand its limitations and trade - offs. Caching is convenient, but disk usage can increase rapidly. The datasets for current projects are stored on a 1 TB, fast SSD disk, and there were a couple of times when the disk ran out of space. Understanding how cache files are stored and when to disable caching and manually save is necessary. Data exploration is better suited for smaller datasets or sampled ones, but actual processing is most efficient when the required transformations are identified and applied in a single map operation.
There was a noticeable delay before training started. Fortunately, the reason was found, discussed on Slack and in the forums, and a workaround was created.
The WER metric crashed on large datasets. Evaluation was done on a small sample (also faster), and an accumulative version of wer that runs on fixed memory was written. It would be good to verify whether this change makes sense to be used inside the training loop.
torchaudio deadlocks when using multiple processes. librosa works fine and this issue needs further investigation.
When using num_proc inside a notebook, progress bars could not be seen. This is likely a permissions issue on the computer and needs to be resolved.

🔧 Technical Details

The model is based on the fine - tuning of facebook/wav2vec2-large-xlsr-53 on the Spanish language using the Common Voice dataset. Different training strategies were explored to optimize the Word Error Rate (WER).

📄 License

This project is licensed under the Apache 2.0 license.

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Spanish
Training Data	Common Voice `train` and `validation` datasets

⚠️ Important Note

When using this model, make sure that your speech input is sampled at 16kHz.

💡 Usage Tip

For evaluation, if the wer.compute function crashes on large datasets, use the chunked_wer function provided in the evaluation script.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご