Open-source whisper-turbo-ksc2 Automatic Speech Recognition Model

Whisper Turbo Ksc2

Developed by abilmansplus

This is an automatic speech recognition model fine-tuned on approximately 1000 hours of Kazakh speech data based on the Whisper large-v3-turbo model, with a character error rate of 9.16% on the test set.

Speech Recognition

Transformers

OtherOpen Source License:MIT #Kazakh speech recognition #Low WER transcription #Long audio chunk processing

Downloads 1,740

Release Time : 5/1/2025

Model Overview

A speech recognition model specifically optimized for Kazakh, capable of accurately transcribing Kazakh speech content.

Model Features

High-precision Kazakh recognition

Fine-tuned on 1000 hours of Kazakh data, with a character error rate of only 9.16% on the test set.

Long audio processing ability

Supports transcribing long audio over 30 seconds through chunk processing.

Optimized based on Whisper

Fine-tuned based on the Whisper large-v3-turbo model, inheriting its excellent features.

Model Capabilities

Kazakh speech recognition

Long audio transcription

High-quality speech-to-text

Use Cases

Speech transcription

Kazakh meeting records

Automatically transcribe the content of Kazakh meetings.

Accuracy of 90.84%

Subtitle generation for media content

Automatically generate subtitles for Kazakh video content.

🚀 Whisper Turbo KSC2 Model

This is a fine-tuned model based on Whisper large-v3-turbo, designed for automatic speech recognition of Kazakh audio.

🚀 Quick Start

This model is a Whisper large-v3-turbo fine - tuned on the Kazakh Speech Corpus 2, which contains about 1000 hours of transcribed audio from diverse sources. After training on the Train partition, it achieved a 9.16% WER on the Test partition.

💻 Usage Examples

Basic Usage

import librosa
import numpy as np
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

class Transcriber:
    def __init__(
            self,
            model_path="abilmansplus/whisper-turbo-ksc2",
            device="cuda:0",
            sampling_rate=16_000, 
            language="kazakh",  # set to None if audio is not always in Kazakh, it will still do well on Kazakh
            task="transcribe",
            num_beams=5,
            chunk_length_s=30,  # chunk duration (seconds)
            stride_length_s=1  # overlap (seconds) between chunks
        ):
        self.processor = WhisperProcessor.from_pretrained(
            model_path,
            language=language, 
            task=task
        )
        self.model = WhisperForConditionalGeneration.from_pretrained(model_path)
        self.model = self.model.to(device)
        self.sr = sampling_rate
        self.language=language  # language can be None or "kazakh", any of those will work with this model
        self.task = task
        self.num_beams=num_beams
        self.chunk_length_s = chunk_length_s  # chunk length in seconds
        self.stride_length_s = stride_length_s  # overlap between chunks in seconds   
    
    def transcribe(self, audio_path: str) -> str:
        """transcribes the audio chunk by chunk and merges the results
        Args:
            audio_path (str): path to the audio to be transcribed
        Returns:
            full_transcription (str): transcription of the entire audio 
        """
        speech_array, sampling_rate = librosa.load(audio_path, sr=self.sr)
        audio_length_s = len(speech_array) / self.sr
        
        # If audio is shorter than chunk_length_s, process normally
        if audio_length_s <= self.chunk_length_s:
            full_transcription = self._transcribe_chunk(speech_array)
            return full_transcription
        
        # For longer audio, process in chunks
        chunk_length_samples = int(self.chunk_length_s * self.sr)
        stride_length_samples = int(self.stride_length_s * self.sr)

        # Calculate number of chunks
        num_samples = len(speech_array)
        num_chunks = max(1, 
                         int(
                             1 +
                             np.ceil(
                                     (num_samples - chunk_length_samples) / 
                                     (chunk_length_samples - stride_length_samples)
                                    ) 
                            )
                        )

        transcriptions = []

        for i in range(num_chunks):
            # Calculate chunk start and end
            start = max(0, i * (chunk_length_samples - stride_length_samples))
            end = min(num_samples, start + chunk_length_samples)
            
            # Get audio chunk
            chunk = speech_array[start:end]
            
            # Transcribe chunk
            chunk_transcription = self._transcribe_chunk(chunk)
            transcriptions.append(chunk_transcription)
        
        # Combine transcriptions
        full_transcription = " ".join(transcriptions)
        return full_transcription        

    def _transcribe_chunk(self, audio_chunk) -> str:
        # Process inputs
        inputs = self.processor(
            audio_chunk, 
            sampling_rate=self.sr, 
            return_tensors="pt"
        ).input_features.to(self.model.device)
        
        # Get forced decoder IDs for language and task
        forced_decoder_ids = self.processor.get_decoder_prompt_ids(
            language=self.language, 
            task=self.task
        )

        # The attention mask should be 1 for all positions in the input features
        attention_mask = torch.ones_like(inputs[:, :, 0])
        
        # Generate transcription
        with torch.no_grad():
            generated_ids = self.model.generate(
                inputs, 
                forced_decoder_ids=forced_decoder_ids,
                max_length=448,
                num_beams=self.num_beams,
                attention_mask=attention_mask,
            )
        
        # Decode the generated IDs to text
        transcription = self.processor.batch_decode(
            generated_ids, 
            skip_special_tokens=True
        )[0]
        
        return transcription

Advanced Usage

⚠️ Important Note

For longer audio (35+ seconds), you can divide them into 30 - second chunks, transcribe each chunk separately, and then merge the results.

📄 License

This project is released under the MIT license.

📦 Model Information

Property	Details
Model Type	Fine - tuned Whisper large - v3 - turbo
Training Data	issai/Kazakh_Speech_Corpus_2
Evaluation Metric	WER = 9.16%
Base Model	openai/whisper-large-v3-turbo
Pipeline Tag	automatic-speech-recognition
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご