🚀 wav2vec2-large-xlsr-53-spanish-ep5-944h
The "wav2vec2-large-xlsr-53-spanish-ep5-944h" is an acoustic model designed for Automatic Speech Recognition in Spanish. It's fine - tuned from "facebook/wav2vec2-large-xlsr-53" with about 944 hours of Spanish data.
🚀 Quick Start
The "wav2vec2-large-xlsr-53-spanish-ep5-944h" is an acoustic model suitable for Automatic Speech Recognition in Spanish. It results from fine - tuning the model "facebook/wav2vec2-large-xlsr-53" for 5 epochs using around 944 hours of Spanish data. This data was gathered or developed by the CIEMPIESS-UNAM Project since 2012. Most of the data can be found at the CIEMPIESS-UNAM Project homepage http://www.ciempiess.org/, and the rest is available in public repositories like LDC or OpenSLR.
✨ Features
- Suitable for Automatic Speech Recognition in Spanish.
- Fine - tuned with a large amount of Spanish data from various sources.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
import torch
from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2ForCTC
MODEL_NAME="carlosdanielhernandezmena/wav2vec2-large-xlsr-53-spanish-ep5-944h"
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("ciempiess/ciempiess_test", split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
def prepare_dataset(batch):
audio = batch["audio"]
batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
with processor.as_target_processor():
batch["labels"] = processor(batch["normalized_text"]).input_ids
return batch
ds = ds.map(prepare_dataset, remove_columns=ds.column_names,num_proc=1)
import numpy as np
wer_metric = load_metric("wer")
def compute_metrics(pred):
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)
pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
pred_str = processor.batch_decode(pred_ids)
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
wer = wer_metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
model = model.to(torch.device("cuda"))
def map_to_result(batch):
with torch.no_grad():
input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
logits = model(input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_str"] = processor.batch_decode(pred_ids)[0]
batch["sentence"] = processor.decode(batch["labels"], group_tokens=False)
return batch
results = ds.map(map_to_result,remove_columns=ds.column_names)
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["sentence"])))
Test Result: 0.112
📚 Documentation
Fine - Tuning Data
The specific list of corpora used to fine - tune the model is:
Evaluation Results
Task |
Dataset |
Split |
WER |
Automatic Speech Recognition |
Mozilla Common Voice 10.0 |
Test |
9.20 |
Automatic Speech Recognition |
Mozilla Common Voice 10.0 |
Validation |
8.02 |
Automatic Speech Recognition |
CIEMPIESS-TEST |
Test |
11.17 |
Automatic Speech Recognition |
1997 Spanish Broadcast News Speech (HUB4-NE) |
Test |
7.48 |
Automatic Speech Recognition |
CALLHOME Spanish Speech |
Test |
39.12 |
Automatic Speech Recognition |
CALLHOME Spanish Speech |
Validation |
40.39 |
🔧 Technical Details
The fine - tuning process was carried out in November 2022 on the servers of the Language and Voice Lab (https://lvl.ru.is/) at Reykjavík University (Iceland) by Carlos Daniel Hernández Mena.
📄 License
The model is licensed under cc - by - 4.0.
📖 BibTeX entry and citation info
When publishing results based on these models please refer to:
@misc{mena2022xlrs53spanish,
title={Acoustic Model in Spanish: wav2vec2-large-xlsr-53-spanish-ep5-944h.},
author={Hernandez Mena, Carlos Daniel},
url={https://huggingface.co/carlosdanielhernandezmena/wav2vec2-large-xlsr-53-spanish-ep5-944h},
year={2022}
}
🙏 Acknowledgements
The author thanks the social service program "Desarrollo de Tecnologías del Habla" at the Facultad de Ingeniería (FI) of the Universidad Nacional Autónoma de México (UNAM). Also, thanks to the social service students for their hard work.
Special thanks to Jón Guðnason, the head of the Language and Voice Lab, for providing computational power. The author also thanks the "Language Technology Programme for Icelandic 2019 - 2023" managed and coordinated by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture.