The open-source wav2vec2-large-xlsr-53-portuguese model - Free support for Portuguese speech-to-text conversion

Wav2vec2 Large Xlsr 53 Portuguese

Developed by jonatasgrosman

This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.

Speech Recognition OtherOpen Source License:Apache-2.0 #Portuguese speech recognition #XLSR-53 fine-tuning #Low word error rate

Downloads 4.9M

Release Time : 3/2/2022

Model Overview

This model is a Portuguese automatic speech recognition (ASR) model fine-tuned based on the facebook/wav2vec2-large-xlsr-53 architecture, capable of converting Portuguese speech into text.

Model Features

High-precision Portuguese recognition

Achieves a word error rate (WER) of 11.31% and a character error rate (CER) of 3.74% on the Common Voice Portuguese test set.

Language model enhancement support

When combined with a language model, the word error rate can be further reduced to 9.01% and the character error rate to 3.21%.

16kHz sampling rate support

Optimized specifically for 16kHz sampled speech input.

GPU-accelerated training

Utilizes GPU computing resources provided by OVHcloud for efficient training.

Model Capabilities

Portuguese speech recognition

Real-time speech-to-text

Batch audio processing

Use Cases

Speech transcription

Meeting transcription

Automatically converts Portuguese meeting recordings into text transcripts

Accuracy approximately 90% (WER 9.01% with LM)

Voice memo conversion

Converts personal voice memos into searchable text

Base accuracy 88.69% (WER 11.31)

Assistive technology

Voice input system

Provides voice input solutions for Portuguese-speaking users

🚀 Fine-tuned XLSR-53 large model for speech recognition in Portuguese

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Portuguese, using the train and validation splits of Common Voice 6.1. It provides high - quality speech recognition for Portuguese.

📦 Information Table

Property	Details
Model Type	Fine - tuned XLSR - 53 large model for Portuguese speech recognition
Training Data	Common Voice 6.1 (train and validation splits), mozilla - foundation/common_voice_6_0
Metrics	WER (Word Error Rate), CER (Character Error Rate)
Tags	audio, automatic - speech - recognition, hf - asr - leaderboard, mozilla - foundation/common_voice_6_0, pt, robust - speech - event, speech, xlsr - fine - tuning - week

🚀 Quick Start

This fine - tuned model is based on facebook/wav2vec2-large-xlsr-53 and trained on Portuguese using the train and validation splits of Common Voice 6.1. When using this model, ensure that your speech input is sampled at 16kHz.

This model has been fine - tuned thanks to the GPU credits generously given by the OVHcloud :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-portuguese")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "pt"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Here is a comparison table of reference and prediction:

Reference	Prediction
NEM O RADAR NEM OS OUTROS INSTRUMENTOS DETECTARAM O BOMBARDEIRO STEALTH.	NEMHUM VADAN OS OLTWES INSTRUMENTOS DE TTÉÃN UM BOMBERDEIRO OSTER
PEDIR DINHEIRO EMPRESTADO ÀS PESSOAS DA ALDEIA	E DIR ENGINHEIRO EMPRESTAR AS PESSOAS DA ALDEIA
OITO	OITO
TRANCÁ - LOS	TRANCAUVOS
REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA	REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA
O YOUTUBE AINDA É A MELHOR PLATAFORMA DE VÍDEOS.	YOUTUBE AINDA É A MELHOR PLATAFOMA DE VÍDEOS
MENINA E MENINO BEIJANDO NAS SOMBRAS	MENINA E MENINO BEIJANDO NAS SOMBRAS
EU SOU O SENHOR	EU SOU O SENHOR
DUAS MULHERES QUE SENTAM - SE PARA BAIXO LENDO JORNAIS.	DUAS MIERES QUE SENTAM - SE PARA BAICLANE JODNÓI
EU ORIGINALMENTE ESPERAVA	EU ORIGINALMENTE ESPERAVA

🔧 Evaluation

To evaluate on mozilla - foundation/common_voice_6_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-portuguese --dataset mozilla-foundation/common_voice_6_0 --config pt --split test

To evaluate on speech - recognition - community - v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-portuguese --dataset speech-recognition-community-v2/dev_data --config pt --split validation --chunk_length_s 5.0 --stride_length_s 1.0

📄 License

This model is under the Apache - 2.0 license.

📚 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr53-large-portuguese,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}ortuguese},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご