wav2vec2-large-ru-golos Open-Source Russian Speech Recognition Model

Wav2vec2 Large Ru Golos

Developed by bond005

A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on the Sberdevices Golos dataset, supporting 16kHz audio input

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Russian speech recognition #Low CER optimization #Far-field speech enhancement

Downloads 1,182

Release Time : 6/21/2022

Model Overview

This model is an optimized automatic speech recognition (ASR) system for Russian, enhanced with techniques like pitch shifting, speed adjustment, and reverberation to improve recognition accuracy across various Russian speech scenarios

Model Features

Russian Optimization

Specially fine-tuned for Russian phonetic characteristics, demonstrating excellent performance across multiple Russian test sets

Audio Enhancement

Incorporates training techniques like pitch shifting, speed adjustment, and reverberation to enhance model robustness

Multi-Scenario Adaptation

Performs well in both close-range (crowd) and far-field speech scenarios

Model Capabilities

Russian speech-to-text

16kHz audio processing

Far-field speech recognition

Use Cases

Speech Transcription

Russian Speech Transcription

Convert Russian speech content into text

Achieves WER 10.144% on the Golos crowd test set

Smart Assistants

Russian Voice Command Recognition

Used for voice command recognition in Russian smart home devices

Achieves WER 20.353% in far-field scenarios

🚀 Wav2Vec2-Large-Ru-Golos

The Wav2Vec2-Large-Ru-Golos model is based on facebook/wav2vec2-large-xlsr-53. It has been fine - tuned for the Russian language using Sberdevices Golos with audio augmentations such as pitch shift, sound acceleration/deceleration, and reverberation. This model is designed for automatic speech recognition in Russian, providing accurate transcriptions of Russian speech.

When using this model, ensure that your speech input is sampled at 16kHz.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Based on facebook/wav2vec2-large-xlsr-53 and fine - tuned for Russian.
Utilizes audio augmentations like pitch shift, sound acceleration/deceleration, and reverberation during fine - tuning.
Suitable for automatic speech recognition tasks in the Russian language.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
 
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("bond005/wav2vec2-large-ru-golos")
model = Wav2Vec2ForCTC.from_pretrained("bond005/wav2vec2-large-ru-golos")
     
# load the test part of Golos dataset and read first soundfile
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
 
# tokenize
processed = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest")  # Batch size 1
 
# retrieve logits
logits = model(processed.input_values, attention_mask=processed.attention_mask).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Advanced Usage

This code snippet shows how to evaluate bond005/wav2vec2-large-ru-golos on Golos dataset's "crowd" and "farfield" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer, cer  # we need word error rate (WER) and character error rate (CER)

# load the test part of Golos Crowd and remove samples with empty "true" transcriptions
golos_crowd_test = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
golos_crowd_test = golos_crowd_test.filter(
    lambda it1: (it1["transcription"] is not None) and (len(it1["transcription"].strip()) > 0)
)

# load the test part of Golos Farfield and remove sampels with empty "true" transcriptions
golos_farfield_test = load_dataset("bond005/sberdevices_golos_100h_farfield", split="test")
golos_farfield_test = golos_farfield_test.filter(
    lambda it2: (it2["transcription"] is not None) and (len(it2["transcription"].strip()) > 0)
)

# load model and tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

# recognize one sound
def map_to_pred(batch):
    # tokenize and vectorize
    processed = processor(
        batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"],
        return_tensors="pt", padding="longest"
    )
    input_values = processed.input_values.to("cuda")
    attention_mask = processed.attention_mask.to("cuda")

    # recognize
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    # decode
    transcription = processor.batch_decode(predicted_ids)
    batch["text"] = transcription[0]
    return batch

# calculate WER and CER on the crowd domain
crowd_result = golos_crowd_test.map(map_to_pred, remove_columns=["audio"])
crowd_wer = wer(crowd_result["transcription"], crowd_result["text"])
crowd_cer = cer(crowd_result["transcription"], crowd_result["text"])
print("Word error rate on the Crowd domain:", crowd_wer)
print("Character error rate on the Crowd domain:", crowd_cer)

# calculate WER and CER on the farfield domain
farfield_result = golos_farfield_test.map(map_to_pred, remove_columns=["audio"])
farfield_wer = wer(farfield_result["transcription"], farfield_result["text"])
farfield_cer = cer(farfield_result["transcription"], farfield_result["text"])
print("Word error rate on the Farfield domain:", farfield_wer)
print("Character error rate on the Farfield domain:", farfield_cer)

📚 Documentation

Evaluation Results

Result (WER, %):

"crowd"	"farfield"
10.144	20.353

Result (CER, %):

"crowd"	"farfield"
2.168	6.030

You can see the evaluation script on other datasets, including Russian Librispeech and SOVA RuDevices, on my Kaggle web - page https://www.kaggle.com/code/bond005/wav2vec2-ru-eval

Model Information

Property	Details
Model Type	Based on facebook/wav2vec2-large-xlsr-53, fine - tuned for Russian
Training Data	Sberdevices Golos, bond005/sova_rudevices, bond005/rulibrispeech
Metrics	WER, CER
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week

📄 License

This model is licensed under the apache - 2.0 license.

📚 Citation

If you want to cite this model you can use this:

@misc{bondarenko2022wav2vec2-large-ru-golos,
  title={XLSR Wav2Vec2 Russian by Ivan Bondarenko},
  author={Bondarenko, Ivan},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/bond005/wav2vec2-large-ru-golos}},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご