オープンソースのwhisper-turbo-ksc2自動音声認識モデル - ケルギス語の音声を高精度に認識

ホーム

Whisper Turbo Ksc2

abilmansplusによって開発

これはWhisper large-v3-turboモデルをベースに、約1000時間のカザフ語音声データで微調整された自動音声認識モデルで、テストセットの文字誤り率は9.16%です。

音声認識

Transformers

その他オープンソースライセンス:MIT #カザフ語音声認識 #低WER文字起こし #長い音声のブロック分割処理

ダウンロード数 1,740

リリース時間 : 5/1/2025

モデル概要

カザフ語に特化して最適化された音声認識モデルで、カザフ語の音声内容を正確に文字起こしできます。

モデル特徴

高精度なカザフ語認識

1000時間のカザフ語データで微調整され、テストセットの文字誤り率はわずか9.16%です。

長い音声の処理能力

ブロック分割処理により、30秒を超える長い音声の文字起こしをサポートします。

Whisperをベースに最適化

Whisper large-v3-turboモデルをベースに微調整され、その優れた特性を引き継いでいます。

モデル能力

カザフ語音声認識

長い音声の文字起こし

高品質な音声からテキストへの変換

使用事例

音声文字起こし

カザフ語会議記録

カザフ語の会議内容を自動的に文字起こしします。

正確率90.84%

メディアコンテンツの字幕生成

カザフ語のビデオコンテンツに自動的に字幕を生成します。

🚀 ハサキ語音声コーパスで微調整されたWhisperモデル

このプロジェクトは、Whisper large-v3-turboモデルをベースに、ハサキ語音声コーパス2（約1000時間の異なるソースからの転写オーディオ）で微調整された自動音声認識モデルです。訓練データセットで訓練された後、このモデルはテストデータセットで**9.16%の文字誤り率（WER）**を達成しました。

🚀 クイックスタート

モデル情報

属性	詳細
モデルタイプ	Whisper large-v3-turboをベースに微調整された自動音声認識モデル
訓練データ	ハサキ語音声コーパス2（issai/Kazakh_Speech_Corpus_2）
評価指標	文字誤り率（WER）、テストデータセットで9.16%
ベースモデル	openai/whisper-large-v3-turbo
ライブラリ名	transformers
ライセンス	MIT

長いオーディオの処理に関する提案

⚠️ 重要な注意事項

長いオーディオ（35秒以上）の場合、30秒の断片に分割してそれぞれを転写し、結果を結合することができます。

💻 使用例

基本的な使用法

import librosa
import numpy as np
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

class Transcriber:
    def __init__(
            self,
            model_path="abilmansplus/whisper-turbo-ksc2",
            device="cuda:0",
            sampling_rate=16_000, 
            language="kazakh",  # set to None if audio is not always in Kazakh, it will still do well on Kazakh
            task="transcribe",
            num_beams=5,
            chunk_length_s=30,  # chunk duration (seconds)
            stride_length_s=1  # overlap (seconds) between chunks
        ):
        self.processor = WhisperProcessor.from_pretrained(
            model_path,
            language=language, 
            task=task
        )
        self.model = WhisperForConditionalGeneration.from_pretrained(model_path)
        self.model = self.model.to(device)
        self.sr = sampling_rate
        self.language=language  # language can be None or "kazakh", any of those will work with this model
        self.task = task
        self.num_beams=num_beams
        self.chunk_length_s = chunk_length_s  # chunk length in seconds
        self.stride_length_s = stride_length_s  # overlap between chunks in seconds   
    
    def transcribe(self, audio_path: str) -> str:
        """transcribes the audio chunk by chunk and merges the results
        Args:
            audio_path (str): path to the audio to be transcribed
        Returns:
            full_transcription (str): transcription of the entire audio 
        """
        speech_array, sampling_rate = librosa.load(audio_path, sr=self.sr)
        audio_length_s = len(speech_array) / self.sr
        
        # If audio is shorter than chunk_length_s, process normally
        if audio_length_s <= self.chunk_length_s:
            full_transcription = self._transcribe_chunk(speech_array)
            return full_transcription
        
        # For longer audio, process in chunks
        chunk_length_samples = int(self.chunk_length_s * self.sr)
        stride_length_samples = int(self.stride_length_s * self.sr)

        # Calculate number of chunks
        num_samples = len(speech_array)
        num_chunks = max(1, 
                         int(
                             1 +
                             np.ceil(
                                     (num_samples - chunk_length_samples) / 
                                     (chunk_length_samples - stride_length_samples)
                                    ) 
                            )
                        )

        transcriptions = []

        for i in range(num_chunks):
            # Calculate chunk start and end
            start = max(0, i * (chunk_length_samples - stride_length_samples))
            end = min(num_samples, start + chunk_length_samples)
            
            # Get audio chunk
            chunk = speech_array[start:end]
            
            # Transcribe chunk
            chunk_transcription = self._transcribe_chunk(chunk)
            transcriptions.append(chunk_transcription)
        
        # Combine transcriptions
        full_transcription = " ".join(transcriptions)
        return full_transcription        

    def _transcribe_chunk(self, audio_chunk) -> str:
        # Process inputs
        inputs = self.processor(
            audio_chunk, 
            sampling_rate=self.sr, 
            return_tensors="pt"
        ).input_features.to(self.model.device)
        
        # Get forced decoder IDs for language and task
        forced_decoder_ids = self.processor.get_decoder_prompt_ids(
            language=self.language, 
            task=self.task
        )

        # The attention mask should be 1 for all positions in the input features
        attention_mask = torch.ones_like(inputs[:, :, 0])
        
        # Generate transcription
        with torch.no_grad():
            generated_ids = self.model.generate(
                inputs, 
                forced_decoder_ids=forced_decoder_ids,
                max_length=448,
                num_beams=self.num_beams,
                attention_mask=attention_mask,
            )
        
        # Decode the generated IDs to text
        transcription = self.processor.batch_decode(
            generated_ids, 
            skip_special_tokens=True
        )[0]
        
        return transcription

高度な使用法

上記のコードは、短いオーディオと長いオーディオの両方を処理できる転写器を実装しています。長いオーディオの場合、指定された長さの断片に分割して処理します。長いオーディオの転写結果を最適化するために、chunk_length_sとstride_length_sパラメータを必要に応じて調整することができます。