🚀 Fine-tuned XLSR-53 large model for speech recognition in Portuguese
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Portuguese, using the train and validation splits of Common Voice 6.1. It provides high - quality speech recognition for Portuguese.
📦 Information Table
Property |
Details |
Model Type |
Fine - tuned XLSR - 53 large model for Portuguese speech recognition |
Training Data |
Common Voice 6.1 (train and validation splits), mozilla - foundation/common_voice_6_0 |
Metrics |
WER (Word Error Rate), CER (Character Error Rate) |
Tags |
audio, automatic - speech - recognition, hf - asr - leaderboard, mozilla - foundation/common_voice_6_0, pt, robust - speech - event, speech, xlsr - fine - tuning - week |
🚀 Quick Start
This fine - tuned model is based on facebook/wav2vec2-large-xlsr-53 and trained on Portuguese using the train and validation splits of Common Voice 6.1. When using this model, ensure that your speech input is sampled at 16kHz.
This model has been fine - tuned thanks to the GPU credits generously given by the OVHcloud :)
The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
💻 Usage Examples
Basic Usage
Using the HuggingSound library:
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-portuguese")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)
Advanced Usage
Writing your own inference script:
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "pt"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese"
SAMPLES = 10
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
for i, predicted_sentence in enumerate(predicted_sentences):
print("-" * 100)
print("Reference:", test_dataset[i]["sentence"])
print("Prediction:", predicted_sentence)
Here is a comparison table of reference and prediction:
Reference |
Prediction |
NEM O RADAR NEM OS OUTROS INSTRUMENTOS DETECTARAM O BOMBARDEIRO STEALTH. |
NEMHUM VADAN OS OLTWES INSTRUMENTOS DE TTÉÃN UM BOMBERDEIRO OSTER |
PEDIR DINHEIRO EMPRESTADO ÀS PESSOAS DA ALDEIA |
E DIR ENGINHEIRO EMPRESTAR AS PESSOAS DA ALDEIA |
OITO |
OITO |
TRANCÁ - LOS |
TRANCAUVOS |
REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA |
REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA |
O YOUTUBE AINDA É A MELHOR PLATAFORMA DE VÍDEOS. |
YOUTUBE AINDA É A MELHOR PLATAFOMA DE VÍDEOS |
MENINA E MENINO BEIJANDO NAS SOMBRAS |
MENINA E MENINO BEIJANDO NAS SOMBRAS |
EU SOU O SENHOR |
EU SOU O SENHOR |
DUAS MULHERES QUE SENTAM - SE PARA BAIXO LENDO JORNAIS. |
DUAS MIERES QUE SENTAM - SE PARA BAICLANE JODNÓI |
EU ORIGINALMENTE ESPERAVA |
EU ORIGINALMENTE ESPERAVA |
🔧 Evaluation
- To evaluate on
mozilla - foundation/common_voice_6_0
with split test
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-portuguese --dataset mozilla-foundation/common_voice_6_0 --config pt --split test
- To evaluate on
speech - recognition - community - v2/dev_data
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-portuguese --dataset speech-recognition-community-v2/dev_data --config pt --split validation --chunk_length_s 5.0 --stride_length_s 1.0
📄 License
This model is under the Apache - 2.0 license.
📚 Citation
If you want to cite this model you can use this:
@misc{grosman2021xlsr53-large-portuguese,
title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}ortuguese},
author={Grosman, Jonatas},
howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese}},
year={2021}
}