Open-source Russian speech recognition model whisper-large-v3-russian-ties-podlodka-v1.0 - Optimize the effect of telephone speech recognition

Home

Whisper Large V3 Russian Ties Podlodka V1.0

Developed by Apel-sin

Russian speech recognition model fused using the TIES merging method, optimized for telephone speech recognition

Speech Recognition

Transformers

Other#Russian Speech Recognition #Telephone Recording Optimization #TIES Model Fusion

Downloads 96

Release Time : 3/4/2025

Model Overview

This model is created by fusing two Russian Whisper models using the TIES merging method, specializing in Russian automatic speech recognition tasks, with particular optimization for telephone recording scenarios.

Model Features

TIES Merging Method

Uses the TIES merging method to fuse two Russian Whisper models, with a connection density of 0.85, employing different weights for the encoder and decoder

Telephone Speech Optimization

Specifically optimized for telephone recording scenarios, recommended to be used with volume preprocessing

Multi-Dataset Training

Trained on multiple Russian speech datasets, including Common Voice 17.0, Taiga Speech, and others

Model Capabilities

Russian Speech Recognition

Telephone Recording Transcription

Long Audio Processing

Use Cases

Speech Transcription

Telephone Recording Transcription

Convert Russian telephone recordings into text

Optimized telephone speech recognition performance

Meeting Minutes

Convert Russian meeting recordings into text records

🚀 Whisper Large V3 Russian Ties Podlodka Model

This is a merged Whisper model for Russian Automatic Speech Recognition (ASR), offering enhanced performance for specific use - cases.

🚀 Quick Start

The new version of the model is available at Apel - sin/whisper-large-v3-russian-ties-podlodka-v1.2.

✨ Features

Multi - base model merge: Merged from antony66/whisper-large-v3-russian and bond005/whisper-large-v3-ru-podlodka.
TIES merge method: Utilizes the TIES merge method for model combination.
OpenAI compatible API support: Can be used with a simple OpenAI compatible API server.

📦 Installation

This section doesn't have specific installation commands, so it is skipped.

💻 Usage Examples

Basic Usage

It is highly recommended to pre - process your records and adjust the volume before performing ASR for phone call processing. For example:

sox record.wav -r 8000 record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-50,-40,-15,0,0 -7 0 0.15

Then, your ASR code should look somewhat like this:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

torch_dtype = torch.bfloat16 # set your preferred type here 

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
    setattr(torch.distributed, "is_initialized", lambda : False) # monkey patching
device = torch.device(device)

whisper = WhisperForConditionalGeneration.from_pretrained(
    "antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
    # add attn_implementation="flash_attention_2" if your GPU supports it
)

processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=whisper,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# read your wav file into variable wav. For example:
from io import BufferIO
wav = BytesIO()
with open('record-normalized.wav', 'rb') as f:
    wav.write(f.read())
wav.seek(0)

# get the transcription
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)

print(asr['text'])

📚 Documentation

Model Details

This model was merged using the TIES merge method.

method: ties
parameters:
  ties_density: 0.85
  encoder_weights:
    - 0.65
    - 0.35
  decoder_weights:
    - 0.6
    - 0.4
models:
  model_a: "/mnt/cloud/llm/whisper/whisper-large-v3-russian"
  model_b: "/mnt/cloud/llm/whisper/whisper-large-v3-ru-podlodka"
output_dir: "/mnt/cloud/llm/whisper/whisper-large-v3-russian-ties-podlodka"

Simple API server

It can be used with a simple OpenAI compatible API server: https://github.com/kreolsky/whisper-api-server/

Work in progress

This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.

🔧 Technical Details

Property	Details
Base Model	antony66/whisper-large-v3-russian, bond005/whisper-large-v3-ru-podlodka
Language	ru
Library Name	transformers
Tags	asr, whisper, russian, mergekit, merge
Datasets	mozilla-foundation/common_voice_17_0, bond005/taiga_speech_v2, bond005/podlodka_speech, bond005/rulibrispeech
Metrics	wer

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご