wav2vec2-xls-r-1b-polish Open-Source Polish Speech Recognition Model - Achieve Accurate 16kHz Speech Recognition for Free

Wav2vec2 Xls R 1b Polish

Developed by jonatasgrosman

This is a Polish automatic speech recognition (ASR) model fine-tuned based on the XLS-R 1-billion parameter model, trained on datasets such as Common Voice 8.0, supporting 16kHz sampling rate audio input.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Polish speech recognition #Low CER high accuracy #Multi-dataset training

Downloads 212

Release Time : 3/2/2022

Model Overview

This model is an optimized automatic speech recognition system for Polish, fine-tuned from Facebook's XLS-R 1-billion parameter model, excelling in Polish speech recognition tasks.

Model Features

High-performance Polish recognition

Achieves 11.01% WER and 2.55% CER on the Common Voice 8.0 test set

Supports language model enhancement

With a language model, WER can be reduced to 7.32% and CER to 1.95%

Large-scale pre-training foundation

Fine-tuned from the XLS-R 1-billion parameter model, featuring powerful speech feature extraction capabilities

Multi-dataset training

Trained using Common Voice 8.0, Multilingual LibriSpeech, and Voxpopuli datasets

Model Capabilities

Polish speech recognition

16kHz audio processing

Batch speech transcription

Use Cases

Speech transcription

Speech-to-text services

Convert Polish speech content into text

Achieves 92.68% accuracy on standard test sets (with language model)

Voice assistants

Polish voice command recognition

Used for voice-controlled devices and applications

🚀 XLS-R Wav2Vec2 Polish by Jonatas Grosman

This is a fine - tuned XLS - R 1B model for Polish speech recognition, which can effectively convert Polish speech into text.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-1b on Polish. It uses the train and validation splits of Common Voice 8.0, Multilingual LibriSpeech, and Voxpopuli. When using this model, ensure that your speech input is sampled at 16kHz.

This model has been fine - tuned by the HuggingSound tool, and thanks to the GPU credits generously given by the OVHcloud.

✨ Features

Automatic Speech Recognition: Capable of accurately transcribing Polish speech.
Fine - tuned on Multiple Datasets: Utilizes data from Common Voice 8.0, Multilingual LibriSpeech, and Voxpopuli for better performance.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-polish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "pl"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-polish"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

🔧 Technical Details

Property	Details
Model Type	Fine - tuned XLS - R 1B for Polish speech recognition
Training Data	Train and validation splits of Common Voice 8.0, Multilingual LibriSpeech, and Voxpopuli

Evaluation Results

Task	Dataset	Metric	Value
Automatic Speech Recognition	Common Voice 8	Test WER	11.01
Automatic Speech Recognition	Common Voice 8	Test CER	2.55
Automatic Speech Recognition	Common Voice 8	Test WER (+LM)	7.32
Automatic Speech Recognition	Common Voice 8	Test CER (+LM)	1.95
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev WER	26.31
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev CER	13.85
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev WER (+LM)	20.33
Automatic Speech Recognition	Robust Speech Event - Dev Data	Dev CER (+LM)	13.0
Automatic Speech Recognition	Robust Speech Event - Test Data	Test WER	22.77

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Evaluation Commands

To evaluate on mozilla-foundation/common_voice_8_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-polish --dataset mozilla-foundation/common_voice_8_0 --config pl --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-polish --dataset speech-recognition-community-v2/dev_data --config pl --split validation --chunk_length_s 5.0 --stride_length_s 1.0

📄 License

This model is licensed under the Apache - 2.0 license.

📖 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr-1b-polish,
  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {P}olish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-polish}},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご