Whisper-large-v3-russian Open-source Russian Speech Recognition Model - Free Deployment for Accurate Russian Recognition

Whisper Large V3 Russian

Developed by antony66

A Russian speech recognition model fine-tuned based on OpenAI Whisper-large-v3, optimized for Russian recognition performance

Speech Recognition

Transformers

Other#Russian speech recognition #Telephone recording optimization #Low word error rate

Downloads 6,665

Release Time : 5/17/2024

Model Overview

This model is a Russian-optimized version of Whisper-large-v3, specifically fine-tuned for Russian speech recognition tasks, significantly improving the accuracy of Russian recognition

Model Features

Russian Optimization

Specifically fine-tuned for Russian speech recognition, significantly improving Russian recognition accuracy

High Performance

On the Common Voice 17.0 Russian dataset, WER decreased from 9.84 to 6.39

Telephone Recording Optimization

Specially optimized for telephone call scenarios, recommended to preprocess recordings for best results

Model Capabilities

Russian speech recognition

Automatic speech-to-text

Supports timestamp return

Use Cases

Speech Transcription

Telephone Recording Transcription

Automatically transcribe Russian telephone conversations into text

WER 6.39

Speech Content Analysis

Automatically analyze and process Russian speech content

🚀 Whisper Large V3 Russian Finetuned Model

This is a finetuned version of the Whisper Large V3 model, optimized for better support of the Russian language in Automatic Speech Recognition (ASR) tasks.

🚀 Quick Start

This model is a finetuned version of openai/whisper-large-v3 designed to better support the Russian language.

✨ Features

Finetuned for Russian: Specifically optimized for the Russian language, achieving better performance on Russian speech recognition tasks.
Trained on Large Dataset: Utilized the Russian part of Common Voice 17.0, which contains over 200k rows, for finetuning.
Lower WER: After finetuning, the Word Error Rate (WER) is reduced from 9.84 to 6.39.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

In order to process phone calls it is highly recommended that you preprocess your records and adjust volume before performing ASR. For example, like this:

sox record.wav -r 16k record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-70,-60,-20,0,0 -5 0 0.2

Then your ASR code should look somewhat like this:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

torch_dtype = torch.bfloat16 # set your preferred type here 

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
    setattr(torch.distributed, "is_initialized", lambda : False) # monkey patching
device = torch.device(device)

whisper = WhisperForConditionalGeneration.from_pretrained(
    "antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
    # add attn_implementation="flash_attention_2" if your GPU supports it
)

processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=whisper,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# read your wav file into variable wav. For example:
from io import BufferIO
wav = BytesIO()
with open('record-normalized.wav', 'rb') as f:
    wav.write(f.read())
wav.seek(0)

# get the transcription
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)

print(asr['text'])

📚 Documentation

Model Details

This is a version of openai/whisper-large-v3 finetuned for better support of Russian language.

Dataset used for finetuning is Common Voice 17.0, Russian part, that contains over 200k rows.

After preprocessing of the original dataset (all splits were mixed and splited to a new train + test split by 0.95/0.05, that is 225761/11883 rows respectively) the original Whisper v3 has WER 9.84 while the finetuned version shows 6.39 (so far).

The finetuning process took over 60 hours on dual Tesla A100 80Gb.

🔧 Technical Details

Model Type: Finetuned version of openai/whisper-large-v3
Training Data: Common Voice 17.0, Russian part, containing over 200k rows
Training Time: Over 60 hours on dual Tesla A100 80Gb
Metrics: Word Error Rate (WER)

Property	Details
Model Type	Finetuned version of openai/whisper-large-v3
Training Data	Common Voice 17.0, Russian part, over 200k rows
Training Time	Over 60 hours on dual Tesla A100 80Gb
Metrics	Word Error Rate (WER)

Work in progress

This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご