Open-source model wav2vec2-large-xlsr-53-spanish-ep5-944h - Accurately realizes automatic Spanish speech recognition

Wav2vec2 Large Xlsr 53 Spanish Ep5 944h

Developed by carlosdanielhernandezmena

An acoustic model for Spanish automatic speech recognition, fine-tuned for 5 epochs based on facebook/wav2vec2-large-xlsr-53 using approximately 944 hours of Spanish data.

Speech Recognition

Transformers

Spanish#Spanish speech recognition #Multi-dialect support #High-precision WER

Downloads 111

Release Time : 12/1/2022

Model Overview

This model is specifically designed for Spanish speech recognition, fine-tuned on a large-scale Spanish dataset, suitable for various Spanish speech recognition scenarios.

Model Features

Multi-dataset training

Trained using approximately 944 hours of Spanish data from the CIEMPIESS-UNAM project and other public repositories

Low WER

Excellent performance on multiple test sets, such as a WER of 9.20% on the Mozilla Common Voice 10.0 test set

Dialect coverage

Training data includes various Spanish dialects, such as those from Mexico, Chile, Colombia, Peru, Argentina, and Puerto Rico

Model Capabilities

Spanish speech recognition

Multi-dialect recognition

High-precision transcription

Use Cases

Speech transcription

Broadcast news transcription

Used for transcribing Spanish broadcast news content

WER of 7.48% on the HUB4NE test set

Telephone speech transcription

Used for transcribing telephone conversation content

WER of 39.12% on the CALLHOME test set

Voice assistants

Spanish voice command recognition

Used for command recognition in Spanish voice assistants

🚀 wav2vec2-large-xlsr-53-spanish-ep5-944h

The "wav2vec2-large-xlsr-53-spanish-ep5-944h" is an acoustic model designed for Automatic Speech Recognition in Spanish. It's fine - tuned from "facebook/wav2vec2-large-xlsr-53" with about 944 hours of Spanish data.

🚀 Quick Start

The "wav2vec2-large-xlsr-53-spanish-ep5-944h" is an acoustic model suitable for Automatic Speech Recognition in Spanish. It results from fine - tuning the model "facebook/wav2vec2-large-xlsr-53" for 5 epochs using around 944 hours of Spanish data. This data was gathered or developed by the CIEMPIESS-UNAM Project since 2012. Most of the data can be found at the CIEMPIESS-UNAM Project homepage http://www.ciempiess.org/, and the rest is available in public repositories like LDC or OpenSLR.

✨ Features

Suitable for Automatic Speech Recognition in Spanish.
Fine - tuned with a large amount of Spanish data from various sources.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2ForCTC

#Load the processor and model.
MODEL_NAME="carlosdanielhernandezmena/wav2vec2-large-xlsr-53-spanish-ep5-944h"
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("ciempiess/ciempiess_test", split="test")

#Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def prepare_dataset(batch):
    audio = batch["audio"]
    #Batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    with processor.as_target_processor():
        batch["labels"] = processor(batch["normalized_text"]).input_ids
    return batch
ds = ds.map(prepare_dataset, remove_columns=ds.column_names,num_proc=1)

#Define the evaluation metric
import numpy as np
wer_metric = load_metric("wer")
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)
    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.batch_decode(pred_ids)
    #We do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

#Do the evaluation (with batch_size=1)
model = model.to(torch.device("cuda"))
def map_to_result(batch):
    with torch.no_grad():
        input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
        logits = model(input_values).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_str"] = processor.batch_decode(pred_ids)[0]
    batch["sentence"] = processor.decode(batch["labels"], group_tokens=False)
    return batch
results = ds.map(map_to_result,remove_columns=ds.column_names)

#Compute the overall WER now.
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["sentence"])))

Test Result: 0.112

📚 Documentation

Fine - Tuning Data

The specific list of corpora used to fine - tune the model is:

CIEMPIESS-LIGHT (18h25m)
CIEMPIESS-BALANCE (18h20m)
CIEMPIESS-FEM (13h54m)
CHM150 (1h38m)
TEDX_SPANISH (24h29m)
LIBRIVOX_SPANISH (73h01m)
WIKIPEDIA_SPANISH (25h37m)
VOXFORGE_SPANISH (49h42m)
MOZILLA COMMON VOICE 10.0 (320h22m)
HEROICO (16h33m)
LATINO-40 (6h48m)
CALLHOME_SPANISH (13h22m)
HUB4NE_SPANISH (31h41m)
FISHER_SPANISH (127h22m)
Chilean Spanish speech data set (7h08m)
Colombian Spanish speech data set (7h34m)
Peruvian Spanish speech data set (9h13m)
Argentinian Spanish speech data set (8h01m)
Puerto Rico Spanish speech data set (1h00m)
MediaSpeech Spanish (10h00m)
DIMEX100-LIGHT (6h09m)
DIMEX100-NIÑOS (08h09m)
GOLEM-UNIVERSUM (00h10m)
GLISSANDO (6h40m)
TELE_con_CIENCIA (28h16m) Unplished Material
UNSHAREABLE MATERIAL (118h22m) Not available for sharing

Evaluation Results

Task	Dataset	Split	WER
Automatic Speech Recognition	Mozilla Common Voice 10.0	Test	9.20
Automatic Speech Recognition	Mozilla Common Voice 10.0	Validation	8.02
Automatic Speech Recognition	CIEMPIESS-TEST	Test	11.17
Automatic Speech Recognition	1997 Spanish Broadcast News Speech (HUB4-NE)	Test	7.48
Automatic Speech Recognition	CALLHOME Spanish Speech	Test	39.12
Automatic Speech Recognition	CALLHOME Spanish Speech	Validation	40.39

🔧 Technical Details

The fine - tuning process was carried out in November 2022 on the servers of the Language and Voice Lab (https://lvl.ru.is/) at Reykjavík University (Iceland) by Carlos Daniel Hernández Mena.

📄 License

The model is licensed under cc - by - 4.0.

📖 BibTeX entry and citation info

When publishing results based on these models please refer to:

@misc{mena2022xlrs53spanish,
      title={Acoustic Model in Spanish: wav2vec2-large-xlsr-53-spanish-ep5-944h.}, 
      author={Hernandez Mena, Carlos Daniel},
      url={https://huggingface.co/carlosdanielhernandezmena/wav2vec2-large-xlsr-53-spanish-ep5-944h},
      year={2022}
}

🙏 Acknowledgements

The author thanks the social service program "Desarrollo de Tecnologías del Habla" at the Facultad de Ingeniería (FI) of the Universidad Nacional Autónoma de México (UNAM). Also, thanks to the social service students for their hard work.

Special thanks to Jón Guðnason, the head of the Language and Voice Lab, for providing computational power. The author also thanks the "Language Technology Programme for Icelandic 2019 - 2023" managed and coordinated by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご