开源whisper-turbo-ksc2自动语音识别模型 - 精准识别哈萨克语语音

首页

Whisper Turbo Ksc2

由 abilmansplus 开发

这是一个基于Whisper large-v3-turbo模型，在约1000小时哈萨克语语音数据上微调的自动语音识别模型，测试集字错率9.16%

语音识别

Transformers

其他开源协议:MIT #哈萨克语语音识别 #低WER转录 #长音频分块处理

下载量 1,740

发布时间 : 5/1/2025

模型简介

专门针对哈萨克语优化的语音识别模型，能够准确转录哈萨克语语音内容

模型特点

高精度哈萨克语识别

在1000小时哈萨克语数据上微调，测试集字错率仅9.16%

长音频处理能力

支持通过分块处理方式转录超过30秒的长音频

基于Whisper优化

基于Whisper large-v3-turbo模型微调，继承其优秀特性

模型能力

哈萨克语语音识别

长音频转录

高质量语音转文本

使用案例

语音转录

哈萨克语会议记录

自动转录哈萨克语会议内容

准确率90.84%

媒体内容字幕生成

为哈萨克语视频内容自动生成字幕

🚀 基于哈萨克语语音语料库微调的Whisper模型

本项目是一个基于Whisper large-v3-turbo模型，在哈萨克语语音语料库2（约1000小时来自不同来源的转录音频）上进行微调的自动语音识别模型。在训练集上训练后，该模型在测试集上实现了9.16%的字错率（WER）。

🚀 快速开始

模型信息

属性	详情
模型类型	基于Whisper large-v3-turbo微调的自动语音识别模型
训练数据	哈萨克语语音语料库2（issai/Kazakh_Speech_Corpus_2）
评估指标	字错率（WER），测试集上达到9.16%
基础模型	openai/whisper-large-v3-turbo
库名称	transformers
许可证	MIT

长音频处理建议

⚠️ 重要提示

对于较长的音频（35秒以上），可以将其分割成30秒的片段，分别对每个片段进行转录，然后合并结果。

💻 使用示例

基础用法

import librosa
import numpy as np
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

class Transcriber:
    def __init__(
            self,
            model_path="abilmansplus/whisper-turbo-ksc2",
            device="cuda:0",
            sampling_rate=16_000, 
            language="kazakh",  # set to None if audio is not always in Kazakh, it will still do well on Kazakh
            task="transcribe",
            num_beams=5,
            chunk_length_s=30,  # chunk duration (seconds)
            stride_length_s=1  # overlap (seconds) between chunks
        ):
        self.processor = WhisperProcessor.from_pretrained(
            model_path,
            language=language, 
            task=task
        )
        self.model = WhisperForConditionalGeneration.from_pretrained(model_path)
        self.model = self.model.to(device)
        self.sr = sampling_rate
        self.language=language  # language can be None or "kazakh", any of those will work with this model
        self.task = task
        self.num_beams=num_beams
        self.chunk_length_s = chunk_length_s  # chunk length in seconds
        self.stride_length_s = stride_length_s  # overlap between chunks in seconds   
    
    def transcribe(self, audio_path: str) -> str:
        """transcribes the audio chunk by chunk and merges the results
        Args:
            audio_path (str): path to the audio to be transcribed
        Returns:
            full_transcription (str): transcription of the entire audio 
        """
        speech_array, sampling_rate = librosa.load(audio_path, sr=self.sr)
        audio_length_s = len(speech_array) / self.sr
        
        # If audio is shorter than chunk_length_s, process normally
        if audio_length_s <= self.chunk_length_s:
            full_transcription = self._transcribe_chunk(speech_array)
            return full_transcription
        
        # For longer audio, process in chunks
        chunk_length_samples = int(self.chunk_length_s * self.sr)
        stride_length_samples = int(self.stride_length_s * self.sr)

        # Calculate number of chunks
        num_samples = len(speech_array)
        num_chunks = max(1, 
                         int(
                             1 +
                             np.ceil(
                                     (num_samples - chunk_length_samples) / 
                                     (chunk_length_samples - stride_length_samples)
                                    ) 
                            )
                        )

        transcriptions = []

        for i in range(num_chunks):
            # Calculate chunk start and end
            start = max(0, i * (chunk_length_samples - stride_length_samples))
            end = min(num_samples, start + chunk_length_samples)
            
            # Get audio chunk
            chunk = speech_array[start:end]
            
            # Transcribe chunk
            chunk_transcription = self._transcribe_chunk(chunk)
            transcriptions.append(chunk_transcription)
        
        # Combine transcriptions
        full_transcription = " ".join(transcriptions)
        return full_transcription        

    def _transcribe_chunk(self, audio_chunk) -> str:
        # Process inputs
        inputs = self.processor(
            audio_chunk, 
            sampling_rate=self.sr, 
            return_tensors="pt"
        ).input_features.to(self.model.device)
        
        # Get forced decoder IDs for language and task
        forced_decoder_ids = self.processor.get_decoder_prompt_ids(
            language=self.language, 
            task=self.task
        )

        # The attention mask should be 1 for all positions in the input features
        attention_mask = torch.ones_like(inputs[:, :, 0])
        
        # Generate transcription
        with torch.no_grad():
            generated_ids = self.model.generate(
                inputs, 
                forced_decoder_ids=forced_decoder_ids,
                max_length=448,
                num_beams=self.num_beams,
                attention_mask=attention_mask,
            )
        
        # Decode the generated IDs to text
        transcription = self.processor.batch_decode(
            generated_ids, 
            skip_special_tokens=True
        )[0]
        
        return transcription