# whisper-large-v3-russian-ties-podlodka-v1.2 Open-Source Model - Free Optimization for Russian Speech Recognition in Phone Recordings

Whisper Large V3 Russian Ties Podlodka V1.2

Developed by Apel-sin

Russian speech recognition model based on TIES fusion method, integrating two Whisper-large-v3 Russian variants, optimized for telephone call scenarios

Speech Recognition

Transformers

Other#Russian Telephone Call Transcription #TIES Fusion Model #Low-Resource Optimization

Downloads 2,408

Release Time : 4/2/2025

Model Overview

This model merges two Russian Whisper models through the TIES fusion method, focusing on improving Russian speech recognition accuracy, with special optimization for call recording scenarios

Model Features

TIES Fusion Technology

Utilizes advanced TIES model fusion method with sparse density 0.9, differentiated encoder/decoder weight allocation (0.8/0.2 and 0.2/0.8)

Call Scenario Optimization

Specifically optimized for telephone call scenarios, recommended to use with audio preprocessing pipeline

Multi-Dataset Training

Incorporates multiple Russian speech datasets including Common Voice 17.0, Taiga Speech, and Podlodka

Model Capabilities

Russian speech-to-text

Long audio chunk processing

Timestamp generation

Low-resource device support

Use Cases

Speech Transcription

Call Recording Transcription

Convert Russian telephone conversations into text transcripts

Optimized recognition accuracy for call audio

Meeting Minutes Generation

Automatically generate text transcripts from Russian meeting recordings

Supports long audio chunk processing

🚀 Whisper Russian Model

This is a merged Whisper model for Russian speech recognition, offering high - quality ASR capabilities.

🚀 Quick Start

This README provides details about a merged Whisper model for Russian speech recognition, including its base models, training datasets, merge method, and usage examples.

✨ Features

Multi - base model merge: Merged from antony66/whisper-large-v3-russian and bond005/whisper-large-v3-ru-podlodka.
Diverse training data: Trained on datasets like mozilla-foundation/common_voice_17_0, bond005/taiga_speech_v2, etc.
TIES merge method: Utilizes the TIES method for model merging.

📦 Installation

This section does not provide specific installation steps. If you want to use the model, you need to have the transformers library installed. You can install it using the following command:

pip install transformers

📚 Documentation

Model Information

Property	Details
Base Models	`antony66/whisper-large-v3-russian`, `bond005/whisper-large-v3-ru-podlodka`
Language	Russian
Library Name	`transformers`
Tags	`asr`, `whisper`, `russian`, `mergekit`, `merge`
Datasets	`mozilla-foundation/common_voice_17_0`, `bond005/taiga_speech_v2`, `bond005/podlodka_speech`, `bond005/rulibrispeech`
Metrics	`wer`

Model Details

This model was merged using the TIES merge method.

method: ties
parameters:
  ties_density: 0.9
  encoder_weights:
    - 0.8
    - 0.2
  decoder_weights:
    - 0.2
    - 0.8
models:
  model_a: "/mnt/cloud/llm/whisper/whisper-large-v3-russian"
  model_b: "/mnt/cloud/llm/whisper/whisper-large-v3-ru-podlodka"
output_dir: "/mnt/cloud/llm/whisper/whisper-large-v3-russian-ties-podlodka"

Simple API server

It can be used with a simple OpenAI compatible API server: https://github.com/kreolsky/whisper-api-server/

💻 Usage Examples

Basic Usage

In order to process phone calls it is highly recommended that you preprocess your records and adjust volume before performing ASR. For example, like this:

sox record.wav -r 8000 record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-50,-40,-15,0,0 -7 0 0.15

Then your ASR code should look somewhat like this:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

torch_dtype = torch.bfloat16 # set your preferred type here 

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
    setattr(torch.distributed, "is_initialized", lambda : False) # monkey patching
device = torch.device(device)

whisper = WhisperForConditionalGeneration.from_pretrained(
    "antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
    # add attn_implementation="flash_attention_2" if your GPU supports it
)

processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=whisper,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# read your wav file into variable wav. For example:
from io import BufferIO
wav = BytesIO()
with open('record-normalized.wav', 'rb') as f:
    wav.write(f.read())
wav.seek(0)

# get the transcription
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)

print(asr['text'])

🔧 Technical Details

This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご