distil-whisper-small-cantoneseオープンソース広東語音声認識モデル - 無料で高精度な広東語音声を文字に変換する

ホーム

Distil Whisper Small Cantonese

alvanliiによって開発

これはWhisper Smallをベースにした広東語音声認識蒸留モデルで、Common Voice 16.0で9.7のCER（句読点なし）を達成しました。

音声認識

Transformers

中国語オープンソースライセンス:Apache-2.0 #広東語音声認識 #軽量モデル #低リソース推論

ダウンロード数 187

リリース時間 : 4/3/2024

モデル概要

このモデルはWhisper Smallの蒸留版で、広東語音声認識タスクに特化して最適化されており、より小さなモデルサイズと高速な推論速度を実現しています。

モデル特徴

効率的な推論

オリジナルのWhisper Smallモデルと比較して推論速度が約50%向上、GPU VRAM要件は約2GBのみ

広東語最適化

広東語音声認識タスクに特化してトレーニングと最適化を実施

軽量

デコーダ層数の削減によりモデル圧縮を実現、パラメータ数を242Mから157Mに削減

モデル能力

広東語音声認識

音声からテキストへの変換

音声文字起こし

使用事例

音声文字起こし

広東語会議議事録

広東語会議録音を自動的に文字起こし

Common Voice 16.0テストセットで9.7%の文字誤り率(CER)を達成

メディア字幕生成

広東語動画コンテンツの自動字幕生成

🚀 Distil-Whisper Small zh-HK - Alvin

このモデルは、広東語に特化したalvanlii/whisper-small-cantonese の蒸留バージョンです。Common Voice 16.0で9.7 CER（句読点なし）、11.59 CER（句読点あり）の精度を達成し、通常のWhisper smallモデルの12層ではなく3層のデコーダーを持ち、GPU VRAMを約2GB使用します。

🚀 クイックスタート

このモデルは、広東語の自動音声認識に特化した蒸留バージョンのモデルです。以下のセクションでは、モデルの訓練データ、評価データ、他のモデルとの比較、使用方法について説明します。

✨ 主な機能

このモデルは、alvanlii/whisper-small-cantonese の蒸留バージョンで、広東語に特化しています。
Common Voice 16.0で、9.7 CER（句読点なし）、11.59 CER（句読点あり）の精度を達成しています。
Whisper smallモデルの通常の12層ではなく、3層のデコーダーを持っています。
GPU VRAMを約2GB使用します。

📚 ドキュメント

訓練と評価データ

訓練には以下のデータセットを使用しています。

CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. リンク: https://arxiv.org/pdf/2201.02419.pdf
Common Voice yue と zh-HK の訓練セット

評価には、Common Voice 16.0 yue のテストセットを使用しています。

Whisper Smallとの比較

	`alvanlii/distil-whisper-small-cantonese`	`alvanlii/whisper-small-cantonese`
CER (低いほど良い)	0.097	0.089
GPU推論時間 (sdpa) [s/サンプル]	0.027	0.055
GPU推論 (通常) [s/サンプル]	0.027	0.308
CPU推論 [s/サンプル]	1.3	2.57
パラメータ [M]	157	242

注: 推論時間は、CV16 yue テストセットの平均推論時間を計算しています。

💻 使用例

基本的な使用法

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/distil-whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

高度な使用法

from transformers import pipeline
MODEL_NAME = "alvanlii/distil-whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]