wav2vec2-large-xlsr-polish Open-source Speech Recognition Model - Accurately Identify Polish Speech Content

Wav2vec2 Large Xlsr Polish

Developed by mbien

A speech recognition model fine-tuned on the Common Voice Polish dataset based on facebook/wav2vec2-large-xlsr-53, achieving a test set word error rate of 23.01%

Speech Recognition OtherOpen Source License:Apache-2.0 #Polish speech recognition #XLSR fine-tuning #Low word error rate

Downloads 40

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Polish, capable of converting Polish speech into text.

Model Features

High-accuracy Polish recognition

Achieves a word error rate of 23.01% on the Common Voice Polish test set

No language model required

Can be used directly without additional language model support

Based on XLSR architecture

Uses facebook's wav2vec2-large-xlsr-53 as the base model, with powerful speech feature extraction capabilities

Model Capabilities

Polish speech recognition

Audio to text conversion

16kHz audio processing

Use Cases

Speech transcription

Polish speech transcription

Convert Polish speech content into editable text format

Word error rate 23.01%

Voice assistants

Polish voice command recognition

Used for building Polish voice assistants or voice control systems

🚀 Wav2Vec2-Large-XLSR-53-Polish

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Polish using the Common Voice dataset. It's designed for automatic speech recognition tasks.

📋 Model Information

Property	Details
Model Type	Fine-tuned Wav2Vec2-Large-XLSR-53 for Polish
Training Data	Common Voice (train, validation datasets)
Metrics	Word Error Rate (WER)
Base Model	facebook/wav2vec2-large-xlsr-53

📊 Model Performance

Task	Dataset	Metric	Value
Automatic Speech Recognition	Common Voice pl	Test WER	23.01%

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine-tuned on Polish language using the Common Voice dataset.
Suitable for automatic speech recognition tasks in Polish.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "pl", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("mbien/wav2vec2-large-xlsr-polish")
model = Wav2Vec2ForCTC.from_pretrained("mbien/wav2vec2-large-xlsr-polish")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "pl", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("mbien/wav2vec2-large-xlsr-polish")
model = Wav2Vec2ForCTC.from_pretrained("mbien/wav2vec2-large-xlsr-polish")
model.to("cuda")

chars_to_ignore_regex = '[\—\…\,\?\.\!\-\;\:\"\“\„\%\‘\”\�\«\»\'\’]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 23.01 %

📚 Documentation

The Common Voice train, validation datasets were used for training.

The script used for training can be found here

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご