wav2vec2-large-xlsr-53-russian Open-source Russian Speech Recognition Model

Wav2vec2 Large Xlsr 53 Russian

Developed by jonatasgrosman

A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input

Speech Recognition OtherOpen Source License:Apache-2.0 #Russian speech recognition #Low character error rate #XLSR fine-tuning

Downloads 3.9M

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Russian, fine-tuned on the XLSR-53 architecture, demonstrating excellent performance on the Common Voice Russian dataset

Model Features

High-performance Russian recognition

Achieves 13.3% word error rate and 2.88% character error rate on the Common Voice Russian test set

Language model enhancement support

When combined with a language model, the word error rate can be reduced to 9.57% and character error rate to 2.24%

Multi-dataset training

Trained and validated using Common Voice 6.1 and CSS10 datasets

16kHz sampling rate support

Optimized for 16kHz sampled audio input

Model Capabilities

Russian speech-to-text

Long audio processing (supports chunk processing)

Real-time speech recognition

Use Cases

Speech transcription

Russian speech transcription

Convert Russian speech content to text

Achieves 13.3% word error rate on the Common Voice test set

Voice assistants

Russian voice command recognition

Recognize Russian voice commands

🚀 Fine-tuned XLSR-53 large model for speech recognition in Russian

This is a fine-tuned model based on facebook/wav2vec2-large-xlsr-53, which can perform speech recognition in Russian.

🚀 Quick Start

This model is fine-tuned on Russian using the train and validation splits of Common Voice 6.1 and CSS10. When using this model, ensure that your speech input is sampled at 16kHz.

This model has been fine-tuned thanks to the GPU credits generously given by the OVHcloud :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

✨ Features

Language Support: Specifically fine-tuned for Russian speech recognition.
External Support: Utilizes GPU credits from OVHcloud for training.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-russian")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "ru"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-russian"
SAMPLES = 5

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Results Example

Reference	Prediction
ОН РАБОТАТЬ, А ЕЕ НЕ УДЕРЖАТЬ НИКАК — БЕГАЕТ ЗА КЛЁШЕМ КАЖДОГО БУЛЬВАРНИКА.	ОН РАБОТАТЬ А ЕЕ НЕ УДЕРЖАТ НИКАК БЕГАЕТ ЗА КЛЕШОМ КАЖДОГО БУЛЬБАРНИКА
ЕСЛИ НЕ БУДЕТ ВОЗРАЖЕНИЙ, Я БУДУ СЧИТАТЬ, ЧТО АССАМБЛЕЯ СОГЛАСНА С ЭТИМ ПРЕДЛОЖЕНИЕМ.	ЕСЛИ НЕ БУДЕТ ВОЗРАЖЕНИЙ Я БУДУ СЧИТАТЬ ЧТО АССАМБЛЕЯ СОГЛАСНА С ЭТИМ ПРЕДЛОЖЕНИЕМ
ПАЛЕСТИНЦАМ НЕОБХОДИМО СНАЧАЛА УСТАНОВИТЬ МИР С ИЗРАИЛЕМ, А ЗАТЕМ ДОБИВАТЬСЯ ПРИЗНАНИЯ ГОСУДАРСТВЕННОСТИ.	ПАЛЕСТИНЦАМ НЕОБХОДИМО СНАЧАЛА УСТАНОВИТЬ С НИ МИР ФЕЗРЕЛЕМ А ЗАТЕМ ДОБИВАТЬСЯ ПРИЗНАНИЯ ГОСУДАРСТВЕНСКИ
У МЕНЯ БЫЛО ТАКОЕ ЧУВСТВО, ЧТО ЧТО-ТО ТАКОЕ ОЧЕНЬ ВАЖНОЕ Я ПРИБАВЛЯЮ.	У МЕНЯ БЫЛО ТАКОЕ ЧУВСТВО ЧТО ЧТО-ТО ТАКОЕ ОЧЕНЬ ВАЖНОЕ Я ПРЕДБАВЛЯЕТ
ТОЛЬКО ВРЯД ЛИ ПОЙМЕТ.	ТОЛЬКО ВРЯД ЛИ ПОЙМЕТ
ВРОНСКИЙ, СЛУШАЯ ОДНИМ УХОМ, ПЕРЕВОДИЛ БИНОКЛЬ С БЕНУАРА НА БЕЛЬ-ЭТАЖ И ОГЛЯДЫВАЛ ЛОЖИ.	ЗЛАЗКИ СЛУШАЮ ОТ ОДНИМ УХАМ ТЫ ВОТИ В ВИНОКОТ СПИЛА НА ПЕРЕТАЧ И ОКЛЯДЫВАЛ БОСУ
К СОЖАЛЕНИЮ, СИТУАЦИЯ ПРОДОЛЖАЕТ УХУДШАТЬСЯ.	К СОЖАЛЕНИЮ СИТУАЦИИ ПРОДОЛЖАЕТ УХУЖАТЬСЯ
ВСЁ ЖАЛОВАНИЕ УХОДИЛО НА ДОМАШНИЕ РАСХОДЫ И НА УПЛАТУ МЕЛКИХ НЕПЕРЕВОДИВШИХСЯ ДОЛГОВ.	ВСЕ ЖАЛОВАНИЕ УХОДИЛО НА ДОМАШНИЕ РАСХОДЫ И НА УПЛАТУ МЕЛКИХ НЕ ПЕРЕВОДИВШИХСЯ ДОЛГОВ
ТЕПЕРЬ ДЕЛО, КОНЕЧНО, ЗА ТЕМ, ЧТОБЫ ПРЕВРАТИТЬ СЛОВА В ДЕЛА.	ТЕПЕРЬ ДЕЛАЮ КОНЕЧНО ЗАТЕМ ЧТОБЫ ПРЕВРАТИТЬ СЛОВА В ДЕЛА
ДЕВЯТЬ	ЛЕВЕТЬ

📚 Documentation

Evaluation

To evaluate on mozilla-foundation/common_voice_6_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset mozilla-foundation/common_voice_6_0 --config ru --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset speech-recognition-community-v2/dev_data --config ru --split validation --chunk_length_s 5.0 --stride_length_s 1.0

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr53-large-russian,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {R}ussian},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-russian}},
  year={2021}
}

📋 Model Information

Property	Details
Model Type	Fine-tuned XLSR-53 large model for Russian speech recognition
Training Data	Train and validation splits of Common Voice 6.1 and CSS10
Metrics	WER, CER
Tags	audio, automatic-speech-recognition, hf-asr-leaderboard, mozilla-foundation/common_voice_6_0, robust-speech-event, ru, speech, xlsr-fine-tuning-week

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご