開源whisper-turbo-ksc2自動語音識別模型 - 精準識別哈薩克語語音

首頁

Whisper Turbo Ksc2

由abilmansplus開發

這是一個基於Whisper large-v3-turbo模型，在約1000小時哈薩克語語音數據上微調的自動語音識別模型，測試集字錯率9.16%

語音識別

Transformers

其他開源協議:MIT #哈薩克語語音識別 #低WER轉錄 #長音頻分塊處理

下載量 1,740

發布時間 : 5/1/2025

模型概述

專門針對哈薩克語優化的語音識別模型，能夠準確轉錄哈薩克語語音內容

模型特點

高精度哈薩克語識別

在1000小時哈薩克語數據上微調，測試集字錯率僅9.16%

長音頻處理能力

支持通過分塊處理方式轉錄超過30秒的長音頻

基於Whisper優化

基於Whisper large-v3-turbo模型微調，繼承其優秀特性

模型能力

哈薩克語語音識別

長音頻轉錄

高質量語音轉文本

使用案例

語音轉錄

哈薩克語會議記錄

自動轉錄哈薩克語會議內容

準確率90.84%

媒體內容字幕生成

為哈薩克語視頻內容自動生成字幕

🚀 基於哈薩克語語音語料庫微調的Whisper模型

本項目是一個基於Whisper large-v3-turbo模型，在哈薩克語語音語料庫2（約1000小時來自不同來源的轉錄音頻）上進行微調的自動語音識別模型。在訓練集上訓練後，該模型在測試集上實現了9.16%的字錯率（WER）。

🚀 快速開始

模型信息

屬性	詳情
模型類型	基於Whisper large-v3-turbo微調的自動語音識別模型
訓練數據	哈薩克語語音語料庫2（issai/Kazakh_Speech_Corpus_2）
評估指標	字錯率（WER），測試集上達到9.16%
基礎模型	openai/whisper-large-v3-turbo
庫名稱	transformers
許可證	MIT

長音頻處理建議

⚠️ 重要提示

對於較長的音頻（35秒以上），可以將其分割成30秒的片段，分別對每個片段進行轉錄，然後合併結果。

💻 使用示例

基礎用法

import librosa
import numpy as np
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

class Transcriber:
    def __init__(
            self,
            model_path="abilmansplus/whisper-turbo-ksc2",
            device="cuda:0",
            sampling_rate=16_000, 
            language="kazakh",  # set to None if audio is not always in Kazakh, it will still do well on Kazakh
            task="transcribe",
            num_beams=5,
            chunk_length_s=30,  # chunk duration (seconds)
            stride_length_s=1  # overlap (seconds) between chunks
        ):
        self.processor = WhisperProcessor.from_pretrained(
            model_path,
            language=language, 
            task=task
        )
        self.model = WhisperForConditionalGeneration.from_pretrained(model_path)
        self.model = self.model.to(device)
        self.sr = sampling_rate
        self.language=language  # language can be None or "kazakh", any of those will work with this model
        self.task = task
        self.num_beams=num_beams
        self.chunk_length_s = chunk_length_s  # chunk length in seconds
        self.stride_length_s = stride_length_s  # overlap between chunks in seconds   
    
    def transcribe(self, audio_path: str) -> str:
        """transcribes the audio chunk by chunk and merges the results
        Args:
            audio_path (str): path to the audio to be transcribed
        Returns:
            full_transcription (str): transcription of the entire audio 
        """
        speech_array, sampling_rate = librosa.load(audio_path, sr=self.sr)
        audio_length_s = len(speech_array) / self.sr
        
        # If audio is shorter than chunk_length_s, process normally
        if audio_length_s <= self.chunk_length_s:
            full_transcription = self._transcribe_chunk(speech_array)
            return full_transcription
        
        # For longer audio, process in chunks
        chunk_length_samples = int(self.chunk_length_s * self.sr)
        stride_length_samples = int(self.stride_length_s * self.sr)

        # Calculate number of chunks
        num_samples = len(speech_array)
        num_chunks = max(1, 
                         int(
                             1 +
                             np.ceil(
                                     (num_samples - chunk_length_samples) / 
                                     (chunk_length_samples - stride_length_samples)
                                    ) 
                            )
                        )

        transcriptions = []

        for i in range(num_chunks):
            # Calculate chunk start and end
            start = max(0, i * (chunk_length_samples - stride_length_samples))
            end = min(num_samples, start + chunk_length_samples)
            
            # Get audio chunk
            chunk = speech_array[start:end]
            
            # Transcribe chunk
            chunk_transcription = self._transcribe_chunk(chunk)
            transcriptions.append(chunk_transcription)
        
        # Combine transcriptions
        full_transcription = " ".join(transcriptions)
        return full_transcription        

    def _transcribe_chunk(self, audio_chunk) -> str:
        # Process inputs
        inputs = self.processor(
            audio_chunk, 
            sampling_rate=self.sr, 
            return_tensors="pt"
        ).input_features.to(self.model.device)
        
        # Get forced decoder IDs for language and task
        forced_decoder_ids = self.processor.get_decoder_prompt_ids(
            language=self.language, 
            task=self.task
        )

        # The attention mask should be 1 for all positions in the input features
        attention_mask = torch.ones_like(inputs[:, :, 0])
        
        # Generate transcription
        with torch.no_grad():
            generated_ids = self.model.generate(
                inputs, 
                forced_decoder_ids=forced_decoder_ids,
                max_length=448,
                num_beams=self.num_beams,
                attention_mask=attention_mask,
            )
        
        # Decode the generated IDs to text
        transcription = self.processor.batch_decode(
            generated_ids, 
            skip_special_tokens=True
        )[0]
        
        return transcription