wav2vec2-xls-r-1b-spanish Open-source Model - Accurately Realize Automatic Speech Recognition for Spanish

Wav2vec2 Xls R 1b Spanish

Developed by jonatasgrosman

This is a Spanish automatic speech recognition model fine-tuned based on the XLS-R 1 billion parameter model, trained and optimized on multiple Spanish datasets.

Speech Recognition

Transformers

SpanishOpen Source License:Apache-2.0 #Spanish Speech Recognition #1 Billion Parameter Model #Multi-dataset Training

Downloads 2,270

Release Time : 3/2/2022

Model Overview

This model is optimized for Spanish speech recognition tasks, supporting 16kHz sampled audio input, and performs excellently on datasets such as Common Voice.

Model Features

Large-scale Pretraining

Fine-tuned on the 1 billion parameter XLS-R model with powerful speech feature extraction capabilities.

Multi-dataset Training

Trained on multiple Spanish datasets including Common Voice 8.0, MediaSpeech, and Multilingual TEDx.

High Performance

Achieves a WER of 6.74% on the Common Voice 8 test set (with language model).

Language Model Support

Supports integration with language models to further improve recognition accuracy.

Model Capabilities

Spanish Speech Recognition

16kHz Audio Processing

Batch Speech Transcription

Use Cases

Speech-to-Text

Speech Transcription Service

Convert Spanish speech content into text.

Achieves a WER of 6.74% on standard test sets.

Voice Assistants

Spanish Voice Assistant

Provides voice interaction capabilities for Spanish-speaking users.

🚀 Fine-tuned XLS-R 1B model for speech recognition in Spanish

This is a fine-tuned model for Spanish speech recognition. It is based on facebook/wav2vec2-xls-r-1b and fine-tuned using the train and validation splits of multiple datasets, including Common Voice 8.0, MediaSpeech, Multilingual TEDx, Multilingual LibriSpeech, and Voxpopuli. When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Language Support: Specifically fine-tuned for Spanish speech recognition.
Multiple Datasets: Trained on a variety of datasets for better generalization.
Performance Metrics: Achieved good results on multiple evaluation datasets, including low WER (Word Error Rate) and CER (Character Error Rate).

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-spanish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "es"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-spanish"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

📚 Documentation

Evaluation Commands

To evaluate on mozilla-foundation/common_voice_8_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-spanish --dataset mozilla-foundation/common_voice_8_0 --config es --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-spanish --dataset speech-recognition-community-v2/dev_data --config es --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Model Information

Property	Details
Model Type	Fine-tuned XLS-R 1B model for Spanish speech recognition
Training Data	Common Voice 8.0, MediaSpeech, Multilingual TEDx, Multilingual LibriSpeech, Voxpopuli
Evaluation Results
-	Common Voice 8 (Test Data): Test WER = 9.97, Test CER = 2.85, Test WER (+LM) = 6.74, Test CER (+LM) = 2.24
-	Robust Speech Event - Dev Data: Dev WER = 24.79, Dev CER = 9.7, Dev WER (+LM) = 16.37, Dev CER (+LM) = 8.84
-	Robust Speech Event - Test Data: Test WER = 16.67

📄 License

This model is licensed under the Apache-2.0 license.

📚 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr-1b-spanish,
  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {S}panish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-spanish}},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご