XLS-R-300M-ES Open-Source Speech Recognition Model - Accurately Realize Spanish Speech Recognition

Xls R 300m Es

Developed by polodealvarado

A speech recognition model fine-tuned on the Spanish Common Voice dataset, based on the facebook/wav2vec2-xls-r-300m architecture, achieving a WER of 14.6% on the test set

Speech Recognition

Transformers

SpanishOpen Source License:Apache-2.0 #Spanish Speech Recognition #Low WER Model #5-gram Language Model Integration

Downloads 23

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Spanish, implemented by fine-tuning the XLS-R-300M pre-trained model, suitable for Spanish speech-to-text tasks.

Model Features

High-Performance Spanish Recognition

Achieves a WER of 14.6% on the Common Voice 8.0 Spanish test set

5-gram Language Model Support

Built-in n-gram (n=5) language model support, further reducing WER to 10.9%

Optimized Training Configuration

Uses linear learning rate scheduling and mixed-precision training, optimized over 13 training epochs

Model Capabilities

Spanish Speech Recognition

Real-time Speech-to-Text

Long Audio Processing

Use Cases

Speech Transcription

Spanish Meeting Minutes

Automatically convert Spanish meeting recordings into text transcripts

Accuracy of 85.4% (WER 14.6)

Voice Assistant Development

Used for developing Spanish voice assistants and dialogue systems

Speech Analysis

Speech Content Analysis

Analyze Spanish speech content for sentiment analysis or keyword extraction

🚀 Wav2Vec2-XLSR-300m-es

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the Spanish Common Voice dataset. Thanks to the GPU credits generously provided by OVHcloud for the Speech Recognition challenge, it achieves excellent performance in speech recognition tasks.

✨ Features

Fine - tuned on Spanish Data: Trained on the Spanish Common Voice dataset, making it highly suitable for Spanish speech recognition.
Support for n - gram: Can be used with a 5 - gram model included in the processor, improving recognition accuracy.
Good Evaluation Results: Achieves low WER (Word Error Rate) on multiple evaluation datasets.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The model can be used with n - gram (n = 5) included in the processor as follows.

import re
from transformers import AutoModelForCTC,Wav2Vec2ProcessorWithLM
import torch

# Loading model and processor
processor = Wav2Vec2ProcessorWithLM.from_pretrained("polodealvarado/xls-r-300m-es")
model = AutoModelForCTC.from_pretrained("polodealvarado/xls-r-300m-es")

# Cleaning characters
def remove_extra_chars(batch):
    chars_to_ignore_regex = '[^a-záéíóúñ ]'
    text = batch["translation"][target_lang]
    batch["text"] = re.sub(chars_to_ignore_regex, "", text.lower())
    return batch
    
# Preparing dataset
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"],return_tensors="pt",padding=True).input_values[0]    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch
  

common_voice_test = load_dataset("mozilla-foundation/common_voice_8_0", "es", split="test",use_auth_token=True)
common_voice_test = common_voice_test.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "segment", "up_votes"])
common_voice_test = common_voice_test.cast_column("audio", Audio(sampling_rate=16_000))        
common_voice_test = common_voice_test.map(remove_extra_chars, remove_columns=dataset.column_names)
common_voice_test = common_voice_test.map(prepare_dataset)

# Testing first sample
inputs = torch_tensor(common_voice_test[0]["input_values"])

with torch.no_grad():
    logits = model(inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(logits.numpy()).text
print(text) # 'bien y qué regalo vas a abrir primero'

Advanced Usage

On the other hand, you can execute the eval.py file for evaluation.

# To use GPU: --device 0
$ python eval.py --model_id polodealvarado/xls-r-300m-es --dataset mozilla-foundation/common_voice_8_0 --config es --device 0 --split test

📚 Documentation

Evaluation Results

It achieves the following results on the evaluation set: Without LM:

Loss : 0.1900
Wer : 0.146

With 5 - gram:

WER: 0.109
CER: 0.036

Model Index

Task	Dataset	Metrics
Speech Recognition	mozilla - foundation/common_voice_8_0 es	Test WER: 14.6
Automatic Speech Recognition	Robust Speech Event - Dev Data	Test WER: 28.63
Automatic Speech Recognition	Robust Speech Event - Test Data	Test WER: 29.72

Training Hyperparameters

The following hyperparameters were used during training:

Property	Details
learning_rate	0.0003
train_batch_size	16
eval_batch_size	8
seed	42
gradient_accumulation_steps	2
total_train_batch_size	32
optimizer	Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type	linear
lr_scheduler_warmup_steps	500
num_epochs	4
mixed_precision_training	Native AMP

Training Results

Training Loss	Epoch	Step	Validation Loss	Wer
3.6747	0.3	400	0.6535	0.5926
0.4439	0.6	800	0.3753	0.3193
0.3291	0.9	1200	0.3267	0.2721
0.2644	1.2	1600	0.2816	0.2311
0.24	1.5	2000	0.2647	0.2179
0.2265	1.79	2400	0.2406	0.2048
0.1994	2.09	2800	0.2357	0.1869
0.1613	2.39	3200	0.2242	0.1821
0.1546	2.69	3600	0.2123	0.1707
0.1441	2.99	4000	0.2067	0.1619
0.1138	3.29	4400	0.2044	0.1519
0.1072	3.59	4800	0.1917	0.1457
0.0992	3.89	5200	0.1900	0.1438

Framework Versions

Property	Details
Transformers	4.16.0.dev0
Pytorch	1.10.1+cu102
Datasets	1.17.1.dev0
Tokenizers	0.11.0

🔧 Technical Details

No specific technical details beyond what is already covered in the documentation section are provided, so this section is skipped.

📄 License

The model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご