đ Wav2Vec2-Large-Ru-Golos
The Wav2Vec2-Large-Ru-Golos model is based on facebook/wav2vec2-large-xlsr-53. It has been fine - tuned for the Russian language using Sberdevices Golos with audio augmentations such as pitch shift, sound acceleration/deceleration, and reverberation. This model is designed for automatic speech recognition in Russian, providing accurate transcriptions of Russian speech.
When using this model, ensure that your speech input is sampled at 16kHz.
đ Quick Start
When using this model, make sure that your speech input is sampled at 16kHz.
⨠Features
- Based on facebook/wav2vec2-large-xlsr-53 and fine - tuned for Russian.
- Utilizes audio augmentations like pitch shift, sound acceleration/deceleration, and reverberation during fine - tuning.
- Suitable for automatic speech recognition tasks in the Russian language.
đĻ Installation
No specific installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
processor = Wav2Vec2Processor.from_pretrained("bond005/wav2vec2-large-ru-golos")
model = Wav2Vec2ForCTC.from_pretrained("bond005/wav2vec2-large-ru-golos")
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
processed = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest")
logits = model(processed.input_values, attention_mask=processed.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
Advanced Usage
This code snippet shows how to evaluate bond005/wav2vec2-large-ru-golos on Golos dataset's "crowd" and "farfield" test data.
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer, cer
golos_crowd_test = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
golos_crowd_test = golos_crowd_test.filter(
lambda it1: (it1["transcription"] is not None) and (len(it1["transcription"].strip()) > 0)
)
golos_farfield_test = load_dataset("bond005/sberdevices_golos_100h_farfield", split="test")
golos_farfield_test = golos_farfield_test.filter(
lambda it2: (it2["transcription"] is not None) and (len(it2["transcription"].strip()) > 0)
)
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
def map_to_pred(batch):
processed = processor(
batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"],
return_tensors="pt", padding="longest"
)
input_values = processed.input_values.to("cuda")
attention_mask = processed.attention_mask.to("cuda")
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
batch["text"] = transcription[0]
return batch
crowd_result = golos_crowd_test.map(map_to_pred, remove_columns=["audio"])
crowd_wer = wer(crowd_result["transcription"], crowd_result["text"])
crowd_cer = cer(crowd_result["transcription"], crowd_result["text"])
print("Word error rate on the Crowd domain:", crowd_wer)
print("Character error rate on the Crowd domain:", crowd_cer)
farfield_result = golos_farfield_test.map(map_to_pred, remove_columns=["audio"])
farfield_wer = wer(farfield_result["transcription"], farfield_result["text"])
farfield_cer = cer(farfield_result["transcription"], farfield_result["text"])
print("Word error rate on the Farfield domain:", farfield_wer)
print("Character error rate on the Farfield domain:", farfield_cer)
đ Documentation
Evaluation Results
Result (WER, %):
"crowd" |
"farfield" |
10.144 |
20.353 |
Result (CER, %):
"crowd" |
"farfield" |
2.168 |
6.030 |
You can see the evaluation script on other datasets, including Russian Librispeech and SOVA RuDevices, on my Kaggle web - page https://www.kaggle.com/code/bond005/wav2vec2-ru-eval
Model Information
Property |
Details |
Model Type |
Based on facebook/wav2vec2-large-xlsr-53, fine - tuned for Russian |
Training Data |
Sberdevices Golos, bond005/sova_rudevices, bond005/rulibrispeech |
Metrics |
WER, CER |
Tags |
audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week |
đ License
This model is licensed under the apache - 2.0 license.
đ Citation
If you want to cite this model you can use this:
@misc{bondarenko2022wav2vec2-large-ru-golos,
title={XLSR Wav2Vec2 Russian by Ivan Bondarenko},
author={Bondarenko, Ivan},
publisher={Hugging Face},
journal={Hugging Face Hub},
howpublished={\url{https://huggingface.co/bond005/wav2vec2-large-ru-golos}},
year={2022}
}