wav2vec2-large-xlsr-53-polish Open-Source Speech Recognition System - Accurately Achieve Automatic Polish Speech Recognition

Wav2vec2 Large Xlsr 53 Polish

Developed by jonatasgrosman

XLSR-53 large model speech recognition system optimized for Polish, fine-tuned based on facebook/wav2vec2-large-xlsr-53, supports Polish automatic speech recognition

Speech Recognition OtherOpen Source License:Apache-2.0 #Polish speech recognition #Low character error rate #XLSR fine-tuning

Downloads 412.13k

Release Time : 3/2/2022

Model Overview

This is a Polish speech recognition model based on the XLSR-53 architecture, fine-tuned using the Common Voice 6.1 Polish dataset, suitable for Polish speech-to-text tasks.

Model Features

Polish Optimization

Specially fine-tuned for Polish, achieving a word error rate of 14.21% on the Common Voice Polish test set

Language Model Integration Support

Can be combined with a language model to further improve recognition accuracy, reducing the word error rate to 10.98%

Robust Speech Processing

Performs well on robust speech event datasets, capable of handling speech input in various environments

Model Capabilities

Polish speech recognition

Audio-to-text conversion

Supports 16kHz sample rate audio processing

Use Cases

Speech Transcription

Polish Speech Transcription

Convert Polish speech content into text

Word error rate of 14.21% and character error rate of 3.49% on the Common Voice test set

Voice Assistant

Polish Voice Command Recognition

Recognize and understand Polish voice commands

🚀 Fine-tuned XLSR-53 large model for speech recognition in Polish

This project presents a fine - tuned facebook/wav2vec2-large-xlsr-53 model for Polish speech recognition. It is trained and validated on Common Voice 6.1. Ensure that your speech input is sampled at 16kHz when using this model. The model is fine - tuned with the GPU credits from OVHcloud. The training script can be found at: https://github.com/jonatasgrosman/wav2vec2-sprint.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-polish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "pl"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-polish"
SAMPLES = 5

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Here is a comparison table of reference and prediction:

Reference	Prediction
"""CZY DRZWI BYŁY ZAMKNIĘTE?"""	PRZY DRZWI BYŁY ZAMKNIĘTE
GDZIEŻ TU POWÓD DO WYRZUTÓW?	WGDZIEŻ TO POM DO WYRYDÓ
"""O TEM JEDNAK NIE BYŁO MOWY."""	O TEM JEDNAK NIE BYŁO MOWY
LUBIĘ GO.	LUBIĄ GO
— TO MI NIE POMAGA.	TO MNIE NIE POMAGA
WCIĄŻ LUDZIE WYSIADAJĄ PRZED ZAMKIEM, Z MIASTA, Z PRAGI.	WCIĄŻ LUDZIE WYSIADAJĄ PRZED ZAMKIEM Z MIASTA Z PRAGI
ALE ON WCALE INACZEJ NIE MYŚLAŁ.	ONY MONITCENIE PONACZUŁA NA MASU
A WY, CO TAK STOICIE?	A WY CO TAK STOICIE
A TEN PRZYRZĄD DO CZEGO SŁUŻY?	A TEN PRZYRZĄD DO CZEGO SŁUŻY
NA JUTRZEJSZYM KOLOKWIUM BĘDZIE PIĘĆ PYTAŃ OTWARTYCH I TEST WIELOKROTNEGO WYBORU.	NAJUTRZEJSZYM KOLOKWIUM BĘDZIE PIĘĆ PYTAŃ OTWARTYCH I TEST WIELOKROTNEGO WYBORU

📚 Documentation

Evaluation

To evaluate on mozilla-foundation/common_voice_6_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-polish --dataset mozilla-foundation/common_voice_6_0 --config pl --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-polish --dataset speech-recognition-community-v2/dev_data --config pl --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Model Information

Property	Details
Model Type	Fine - tuned XLSR - 53 large model for Polish speech recognition
Training Data	Train and validation splits of Common Voice 6.1
Metrics	WER, CER
Tags	audio, automatic - speech - recognition, hf - asr - leaderboard, mozilla - foundation/common_voice_6_0, pl, robust - speech - event, speech, xlsr - fine - tuning - week

Results

Automatic Speech Recognition on Common Voice pl:
- Test WER: 14.21
- Test CER: 3.49
- Test WER (+LM): 10.98
- Test CER (+LM): 2.93
Automatic Speech Recognition on Robust Speech Event - Dev Data:
- Dev WER: 33.18
- Dev CER: 15.92
- Dev WER (+LM): 29.31
- Dev CER (+LM): 15.17

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr53-large-polish,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}olish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-polish}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご