ru_whisper_small Open-source Russian Speech Recognition Model - Free Deployment for Accurate Russian Speech Recognition

Ru Whisper Small

Developed by Val123val

Russian speech recognition model fine-tuned based on openai/whisper-small, trained on the Sberdevices_golos_10h_crowd dataset

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Russian speech recognition #Small model fine-tuning #Long audio chunk processing

Downloads 43

Release Time : 12/28/2023

Model Overview

Speech recognition model optimized for Russian, suitable for automatic speech transcription tasks

Model Features

Russian optimization

Specifically fine-tuned for Russian speech data to improve recognition accuracy

Long audio processing

Supports processing audio longer than 30 seconds through chunking algorithms

Timestamp prediction

Can return timestamp information for speech recognition results

Speculative decoding support

Can use auxiliary models to accelerate the inference process

Model Capabilities

Russian speech recognition

Long audio transcription

Timestamp prediction

Use Cases

Speech transcription

Russian meeting minutes

Automatically transcribe Russian meeting content

Russian media content subtitle generation

Automatically generate subtitles for Russian videos

🚀 ru_whisper_small - Val123val

This model is a fine - tuned version of openai/whisper-small on the Sberdevices_golos_10h_crowd dataset, which is potentially quite useful as an ASR solution, especially for Russian speech recognition.

✨ Features

Transformer - based: Whisper is a Transformer based encoder - decoder (sequence - to - sequence) model, trained on 680k hours of labelled speech data with large - scale weak supervision. The Russian language data accounts for only 5k hours of the total.
Fine - tuned: ru_whisper_small is a fine - tuned version of openai/whisper-small on the Sberdevices_golos_10h_crowd dataset, offering enhanced performance for specific tasks.
Long - Form Transcription: Can transcribe audio of arbitrary length using a chunking algorithm.
Speculative Decoding: Supports speculative decoding for faster inference.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small")
model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small")
model.config.forced_decoder_ids = None

# load dataset and read audio files
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Advanced Usage - Long - Form Transcription

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="Val123val/ru_whisper_small",
  chunk_length_s=30,
  device=device,
)

ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]

prediction = pipe(sample.copy(), batch_size=8)["text"]

# we can also return timestamps for the predictions
prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]

Advanced Usage - Faster using with Speculative Decoding

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# load dataset
dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)

# load model
model_id = "Val123val/ru_whisper_small"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# load assistant model
assistant_model_id = "openai/whisper-tiny"

assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

assistant_model.to(device);

# make pipe
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=4,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

📚 Documentation

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 5000

Framework versions

Transformers 4.36.2
Pytorch 2.1.0+cu121
Datasets 2.16.0
Tokenizers 0.15.0

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご