# wav2vec2-large-xlsr-slovene: Open-source Slovene Speech Recognition Model - Achieve Precise Speech Recognition for Free

Wav2vec2 Large Xlsr Slovene

Developed by mrshu

This is a Slovenian speech recognition model fine-tuned from Facebook's wav2vec2-large-xlsr-53 model, trained using the Common Voice dataset.

Speech Recognition OtherOpen Source License:Apache-2.0 #Slovenian speech recognition #Multilingual speech processing #Low-resource language optimization

Downloads 23

Release Time : 3/2/2022

Model Overview

This model is specifically designed for automatic speech recognition tasks in Slovenian, capable of converting Slovenian speech input into text.

Model Features

High-accuracy Slovenian recognition

Speech recognition model optimized specifically for Slovenian

Based on Common Voice dataset

Trained using publicly available high-quality speech datasets

16kHz sampling rate support

Supports standard 16kHz sampling rate speech input

Model Capabilities

Slovenian speech recognition

Speech-to-text

Use Cases

Speech transcription

Voice note transcription

Convert Slovenian voice notes into text

Meeting minutes

Automatically record Slovenian meeting content

Assistive technology

Voice control

Provide voice control interface for Slovenian users

🚀 Wav2Vec2-Large-XLSR-53-Slovene

This is a fine-tuned model based on facebook/wav2vec2-large-xlsr-53 for Slovene speech recognition, using the Common Voice dataset. Ensure your speech input is sampled at 16kHz when using this model.

📋 Metadata

Property	Details
Language	Slovene
Datasets	common_voice
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0
Model Name	XLSR Wav2Vec2 Slovene
Task	Speech Recognition (automatic-speech-recognition)
Dataset	Common Voice sl (common_voice, args: sl)
Test WER	36.97

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine - tuned for Slovene language using the Common Voice dataset.
Can be used directly without a language model.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sl", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("mrshu/wav2vec2-large-xlsr-slovene")
model = Wav2Vec2ForCTC.from_pretrained("mrshu/wav2vec2-large-xlsr-slovene")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

🔧 Evaluation

The model can be evaluated as follows on the Slovene test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
test_dataset = load_dataset("common_voice", "sl", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("mrshu/wav2vec2-large-xlsr-slovene")
model = Wav2Vec2ForCTC.from_pretrained("mrshu/wav2vec2-large-xlsr-slovene")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\«\»\)\(\„\'\–\’\—]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 36.97 %

📚 Training

The Common Voice train, validation datasets were used for training. The script used for training can be found here

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご