🚀 ru_whisper_small - Val123val
This model is a fine - tuned version of openai/whisper-small on the Sberdevices_golos_10h_crowd dataset, which is potentially quite useful as an ASR solution, especially for Russian speech recognition.
✨ Features
- Transformer - based: Whisper is a Transformer based encoder - decoder (sequence - to - sequence) model, trained on 680k hours of labelled speech data with large - scale weak supervision. The Russian language data accounts for only 5k hours of the total.
- Fine - tuned: ru_whisper_small is a fine - tuned version of openai/whisper-small on the Sberdevices_golos_10h_crowd dataset, offering enhanced performance for specific tasks.
- Long - Form Transcription: Can transcribe audio of arbitrary length using a chunking algorithm.
- Speculative Decoding: Supports speculative decoding for faster inference.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small")
model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small")
model.config.forced_decoder_ids = None
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
Advanced Usage - Long - Form Transcription
import torch
from transformers import pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="Val123val/ru_whisper_small",
chunk_length_s=30,
device=device,
)
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]
prediction = pipe(sample.copy(), batch_size=8)["text"]
prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
Advanced Usage - Faster using with Speculative Decoding
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
model_id = "Val123val/ru_whisper_small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="sdpa",
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
assistant_model_id = "openai/whisper-tiny"
assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
assistant_model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="sdpa",
)
assistant_model.to(device);
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=15,
batch_size=4,
generate_kwargs={"assistant_model": assistant_model},
torch_dtype=torch_dtype,
device=device,
)
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
📚 Documentation
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 32
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 5000
Framework versions
- Transformers 4.36.2
- Pytorch 2.1.0+cu121
- Datasets 2.16.0
- Tokenizers 0.15.0
📄 License
This project is licensed under the Apache - 2.0 license.