đ Whisper Russian Model
This is a merged Whisper model for Russian speech recognition, offering high - quality ASR capabilities.
đ Quick Start
This README provides details about a merged Whisper model for Russian speech recognition, including its base models, training datasets, merge method, and usage examples.
⨠Features
- Multi - base model merge: Merged from
antony66/whisper-large-v3-russian
and bond005/whisper-large-v3-ru-podlodka
.
- Diverse training data: Trained on datasets like
mozilla-foundation/common_voice_17_0
, bond005/taiga_speech_v2
, etc.
- TIES merge method: Utilizes the TIES method for model merging.
đĻ Installation
This section does not provide specific installation steps. If you want to use the model, you need to have the transformers
library installed. You can install it using the following command:
pip install transformers
đ Documentation
Model Information
Property |
Details |
Base Models |
antony66/whisper-large-v3-russian , bond005/whisper-large-v3-ru-podlodka |
Language |
Russian |
Library Name |
transformers |
Tags |
asr , whisper , russian , mergekit , merge |
Datasets |
mozilla-foundation/common_voice_17_0 , bond005/taiga_speech_v2 , bond005/podlodka_speech , bond005/rulibrispeech |
Metrics |
wer |
Model Details
This model was merged using the TIES merge method.
method: ties
parameters:
ties_density: 0.9
encoder_weights:
- 0.8
- 0.2
decoder_weights:
- 0.2
- 0.8
models:
model_a: "/mnt/cloud/llm/whisper/whisper-large-v3-russian"
model_b: "/mnt/cloud/llm/whisper/whisper-large-v3-ru-podlodka"
output_dir: "/mnt/cloud/llm/whisper/whisper-large-v3-russian-ties-podlodka"
Simple API server
It can be used with a simple OpenAI compatible API server: https://github.com/kreolsky/whisper-api-server/
đģ Usage Examples
Basic Usage
In order to process phone calls it is highly recommended that you preprocess your records and adjust volume before performing ASR. For example, like this:
sox record.wav -r 8000 record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-50,-40,-15,0,0 -7 0 0.15
Then your ASR code should look somewhat like this:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline
torch_dtype = torch.bfloat16
device = 'cpu'
if torch.cuda.is_available():
device = 'cuda'
elif torch.backends.mps.is_available():
device = 'mps'
setattr(torch.distributed, "is_initialized", lambda : False)
device = torch.device(device)
whisper = WhisperForConditionalGeneration.from_pretrained(
"antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
)
processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=whisper,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=256,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
from io import BufferIO
wav = BytesIO()
with open('record-normalized.wav', 'rb') as f:
wav.write(f.read())
wav.seek(0)
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)
print(asr['text'])
đ§ Technical Details
This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.