🚀 Fine-tuned XLSR-53 large model for speech recognition in Polish
This project presents a fine - tuned facebook/wav2vec2-large-xlsr-53 model for Polish speech recognition. It is trained and validated on Common Voice 6.1. Ensure that your speech input is sampled at 16kHz when using this model. The model is fine - tuned with the GPU credits from OVHcloud. The training script can be found at: https://github.com/jonatasgrosman/wav2vec2-sprint.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
Using the HuggingSound library:
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-polish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)
Advanced Usage
Writing your own inference script:
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "pl"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-polish"
SAMPLES = 5
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
for i, predicted_sentence in enumerate(predicted_sentences):
print("-" * 100)
print("Reference:", test_dataset[i]["sentence"])
print("Prediction:", predicted_sentence)
Here is a comparison table of reference and prediction:
Reference |
Prediction |
"""CZY DRZWI BYŁY ZAMKNIĘTE?""" |
PRZY DRZWI BYŁY ZAMKNIĘTE |
GDZIEŻ TU POWÓD DO WYRZUTÓW? |
WGDZIEŻ TO POM DO WYRYDÓ |
"""O TEM JEDNAK NIE BYŁO MOWY.""" |
O TEM JEDNAK NIE BYŁO MOWY |
LUBIĘ GO. |
LUBIĄ GO |
— TO MI NIE POMAGA. |
TO MNIE NIE POMAGA |
WCIĄŻ LUDZIE WYSIADAJĄ PRZED ZAMKIEM, Z MIASTA, Z PRAGI. |
WCIĄŻ LUDZIE WYSIADAJĄ PRZED ZAMKIEM Z MIASTA Z PRAGI |
ALE ON WCALE INACZEJ NIE MYŚLAŁ. |
ONY MONITCENIE PONACZUŁA NA MASU |
A WY, CO TAK STOICIE? |
A WY CO TAK STOICIE |
A TEN PRZYRZĄD DO CZEGO SŁUŻY? |
A TEN PRZYRZĄD DO CZEGO SŁUŻY |
NA JUTRZEJSZYM KOLOKWIUM BĘDZIE PIĘĆ PYTAŃ OTWARTYCH I TEST WIELOKROTNEGO WYBORU. |
NAJUTRZEJSZYM KOLOKWIUM BĘDZIE PIĘĆ PYTAŃ OTWARTYCH I TEST WIELOKROTNEGO WYBORU |
📚 Documentation
Evaluation
- To evaluate on
mozilla-foundation/common_voice_6_0
with split test
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-polish --dataset mozilla-foundation/common_voice_6_0 --config pl --split test
- To evaluate on
speech-recognition-community-v2/dev_data
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-polish --dataset speech-recognition-community-v2/dev_data --config pl --split validation --chunk_length_s 5.0 --stride_length_s 1.0
Model Information
Property |
Details |
Model Type |
Fine - tuned XLSR - 53 large model for Polish speech recognition |
Training Data |
Train and validation splits of Common Voice 6.1 |
Metrics |
WER, CER |
Tags |
audio, automatic - speech - recognition, hf - asr - leaderboard, mozilla - foundation/common_voice_6_0, pl, robust - speech - event, speech, xlsr - fine - tuning - week |
Results
- Automatic Speech Recognition on Common Voice pl:
- Test WER: 14.21
- Test CER: 3.49
- Test WER (+LM): 10.98
- Test CER (+LM): 2.93
- Automatic Speech Recognition on Robust Speech Event - Dev Data:
- Dev WER: 33.18
- Dev CER: 15.92
- Dev WER (+LM): 29.31
- Dev CER (+LM): 15.17
📄 License
This project is licensed under the Apache - 2.0 license.
📖 Citation
If you want to cite this model you can use this:
@misc{grosman2021xlsr53-large-polish,
title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}olish},
author={Grosman, Jonatas},
howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-polish}},
year={2021}
}