đ Whisper Large V3 Russian Finetuned Model
This is a finetuned version of the Whisper Large V3 model, optimized for better support of the Russian language in Automatic Speech Recognition (ASR) tasks.
đ Quick Start
This model is a finetuned version of openai/whisper-large-v3 designed to better support the Russian language.
⨠Features
- Finetuned for Russian: Specifically optimized for the Russian language, achieving better performance on Russian speech recognition tasks.
- Trained on Large Dataset: Utilized the Russian part of Common Voice 17.0, which contains over 200k rows, for finetuning.
- Lower WER: After finetuning, the Word Error Rate (WER) is reduced from 9.84 to 6.39.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
In order to process phone calls it is highly recommended that you preprocess your records and adjust volume before performing ASR. For example, like this:
sox record.wav -r 16k record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-70,-60,-20,0,0 -5 0 0.2
Then your ASR code should look somewhat like this:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline
torch_dtype = torch.bfloat16
device = 'cpu'
if torch.cuda.is_available():
device = 'cuda'
elif torch.backends.mps.is_available():
device = 'mps'
setattr(torch.distributed, "is_initialized", lambda : False)
device = torch.device(device)
whisper = WhisperForConditionalGeneration.from_pretrained(
"antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
)
processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=whisper,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=256,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
from io import BufferIO
wav = BytesIO()
with open('record-normalized.wav', 'rb') as f:
wav.write(f.read())
wav.seek(0)
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)
print(asr['text'])
đ Documentation
Model Details
This is a version of openai/whisper-large-v3 finetuned for better support of Russian language.
Dataset used for finetuning is Common Voice 17.0, Russian part, that contains over 200k rows.
After preprocessing of the original dataset (all splits were mixed and splited to a new train + test split by 0.95/0.05,
that is 225761/11883 rows respectively) the original Whisper v3 has WER 9.84 while the finetuned version shows 6.39 (so far).
The finetuning process took over 60 hours on dual Tesla A100 80Gb.
đ§ Technical Details
- Model Type: Finetuned version of openai/whisper-large-v3
- Training Data: Common Voice 17.0, Russian part, containing over 200k rows
- Training Time: Over 60 hours on dual Tesla A100 80Gb
- Metrics: Word Error Rate (WER)
Property |
Details |
Model Type |
Finetuned version of openai/whisper-large-v3 |
Training Data |
Common Voice 17.0, Russian part, over 200k rows |
Training Time |
Over 60 hours on dual Tesla A100 80Gb |
Metrics |
Word Error Rate (WER) |
Work in progress
This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.