đ Whisper
Whisper-base for automatic speech recognition (ASR) tailored for the low - resourced Kazakh language. This model offers an effective solution for transcribing Kazakh speech, leveraging extensive training data.
đ Quick Start
This model is a fine - tuned Whisper - base for automatic speech recognition of the Kazakh language. It was trained on the Kazakh Speech Corpus 2 with more than 1k hours of labeled data and achieved a 15.36% WER on the test set.
⨠Features
- Language - Specific: It is a Kazakh - only model, dedicated to high - quality Kazakh speech recognition.
- Long - Form Transcription: Through a chunking algorithm, it can transcribe audio of arbitrary length.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> import librosa
>>>
>>> processor = WhisperProcessor.from_pretrained("akuzdeuov/whisper-base.kk")
>>> model = WhisperForConditionalGeneration.from_pretrained("akuzdeuov/whisper-base.kk")
>>>
>>> audio, sampling_rate = librosa.load("path_to_audio", sr=16000)
>>> input_features = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
>>>
>>> predicted_ids = model.generate(input_features)
>>>
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
The context tokens can be removed from the start of the transcription by setting skip_special_tokens=True
.
Advanced Usage
>>> import torch
>>> from transformers import pipeline
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
>>> pipe = pipeline(
>>> "automatic-speech-recognition",
>>> model="akuzdeuov/whisper-base.kk",
>>> chunk_length_s=30,
>>> device=device,
>>> )
>>> prediction = pipe("path_to_audio", batch_size=8)["text"]
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline
method. Chunking is enabled by setting chunk_length_s = 30
when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference.
đ Documentation
Property |
Details |
Model Type |
whisper - base.kk |
Training Data |
Kazakh Speech Corpus 2 with over 1k hours of labelled data |
Task |
Automatic Speech Recognition |
Dataset |
Kazakh Speech Corpus 2 (KSC2), type: librispeech_asr, config: clean, split: test, language: kk |
Metrics |
Test WER: 15.36% |
đ§ Technical Details
No detailed technical implementation details are provided in the original document, so this section is skipped.
đ License
This project is licensed under the apache - 2.0
license.
đ References
- Whisper, OpenAI.