whisper-small-cantoneseオープンソース広東語音声認識モデル - 無料でデプロイし、高精度に広東語を認識可能

ホーム

Whisper Small Cantonese

alvanliiによって開発

OpenAI Whisper-smallをファインチューニングした広東語音声認識モデル、Common Voice 16.0テストセットでCER7.93を達成

音声認識

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #広東語音声認識 #低CER #高速推論

ダウンロード数 2,413

リリース時間 : 12/8/2022

モデル概要

広東語に最適化された自動音声認識モデル、効率的で正確な広東語音声から文字への変換をサポート

モデル特徴

最適化された広東語認識

広東語の特徴に特化してファインチューニング、文字誤り率(CER)は7.93まで低減

効率的な推論

Flash Attentionによる高速化をサポート、1サンプル処理にわずか0.055秒

多フォーマット対応

GGMLとCT2フォーマットを提供、Whisper.cppやWhisperXなどのツールと互換性あり

推測的デコード対応

補助モデルとして大規模モデルの推論プロセスを加速可能

モデル能力

広東語音声認識

中国語音声認識

高速音声文字変換

長音声処理（チャンク分割対応）

使用事例

音声文字起こし

広東語動画字幕生成

広東語動画コンテンツに自動的に正確な字幕を生成

CER7.93の認識精度

音声アシスタント

広東語対応の音声インタラクションアプリケーション構築

高速応答(0.055秒/サンプル)

音声分析

広東語音声データ分析

広東語音声コンテンツの書き起こしと分析

複数の広東語データセットフォーマットに対応

🚀 Whisper Small Cantonese - Alvin

このモデルは、広東語に対してopenai/whisper-smallをファインチューニングしたバージョンです。Common Voice 16.0では、句読点なしで7.93のCER、句読点ありで9.72のCERを達成しています。

📦 インストール

このセクションではインストールに関する具体的なコマンドが提供されていないため、スキップします。

✨ 主な機能

広東語に対する高精度な自動音声認識を実現。
異なる条件下（句読点の有無など）での低いCERを達成。
高速な推論が可能。

💻 使用例

基本的な使用法

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

高度な使用法

from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese" 
lang = "zh"
device = ...  # デバイスを指定
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]

📚 ドキュメント

トレーニングと評価データ

トレーニングには以下のデータセットを使用しています。

CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf

名前	時間数
Common Voice 16.0 zh-HK Train	138
Common Voice 16.0 yue Train	85
Common Voice 17.0 yue Train	178
Cantonese-ASR	72
CantoMap	23
Pseudo-Labelled YouTube Data	438

評価には、Common Voice 16.0 yue Test setを使用しています。

結果

CER（低いほど良い）: 0.0972
- 前のバージョンの0.1073、0.1581から改善。
CER（句読点を削除）: 0.0793
高速アテンションを使用したGPU推論（以下の例）: 0.055秒/サンプル
- すべてのGPU評価はRTX 3090 GPUで行われています。
GPU推論: 0.308秒/サンプル
CPU推論: 2.57秒/サンプル
GPU VRAM: ~1.5 GB

モデルの高速化

Flash Attentionを使用するには、attn_implementation="sdpa"を追加します。

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "alvanlii/whisper-small-cantonese",
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

Flash Attentionを使用することで、サンプルあたりの処理時間が0.308秒から0.055秒に短縮されます。

推測的デコード

より大きなモデルを使用し、alvanlii/whisper-small-cantoneseを用いて精度をほぼ損なうことなく推論を高速化することができます。

model_id = "simonl0909/whisper-large-v2-cantonese"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

assistant_model_id = "alvanlii/whisper-small-cantonese"

assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

assistant_model.to(device)
...
model.generate(**inputs, use_cache=True, assistant_model=assistant_model)

元のsimonl0909/whisper-large-v2-cantoneseモデルでは、CER 7.65で0.714秒/サンプルの速度で動作します。alvanlii/whisper-small-cantoneseを用いた推測的デコードでは、CER 7.67で0.137秒/サンプルと、大幅に高速化されています。