whisper-th-medium-combinedオープンソースモデル - タイ語の自動音声認識に無料で使用可能

ホーム

Whisper Th Medium Combined

biodatlabによって開発

openai/whisper-mediumをベースに、拡張版のタイ語データセットで微調整し、タイ語の自動音声認識に使用する。

音声認識

Transformers

オープンソースライセンス:Apache-2.0 #タイ語音声認識 #低WERトランスクリプション #複数データセットの微調整

ダウンロード数 4,167

リリース時間 : 12/14/2022

モデル概要

このモデルは、openai/whisper-mediumをベースに、拡張版のmozilla-foundation/common_voice_13_0タイ語データセット、google/fleursデータセット、および選りすぐりのデータセットで微調整したタイ語自動音声認識モデルです。

モデル特徴

高精度タイ語認識

common-voice-13テストセットで7.42の文字誤り率（WER）を達成しました。

複数データセットの微調整

mozilla-foundation/common_voice_13_0、google/fleurs、および選りすぐりのデータセットに基づいて微調整します。

長い音声の処理をサポート

chunk_length_s=30の長い音声の分割処理をサポートします。

モデル能力

タイ語音声認識

長い音声のトランスクリプション

使用事例

音声トランスクリプション

タイ語音声を文字に変換

タイ語の音声ファイルを文字に変換します。

文字誤り率7.42

🚀 Whisper Medium (泰語)：Combined V3

このモデルは、openai/whisper-medium をベースに、拡張版の mozilla-foundation/common_voice_13_0 泰語データセット、google/fleurs データセット、および選択されたデータセットで微調整されたものです。common-voice-13 テストセットでは、以下の成績を達成しています。

単語誤り率（WER）：7.42（Deepcut 形態素解析器を使用）

🚀 クイックスタート

モデルの説明

huggingface の transformers ライブラリを使用して、以下のようにこのモデルを使用することができます。

from transformers import pipeline

MODEL_NAME = "biodatlab/whisper-th-medium-combined"  # モデル名を指定
lang = "th"  # 泰語に変更

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
  language=lang,
  task="transcribe"
)
text = pipe("audio.mp3")["text"] # 音声ファイルを入力して文字起こしを行う

💻 使用例

基本的な使用法

# 使用例のコードはそのまま保持
from transformers import pipeline

MODEL_NAME = "biodatlab/whisper-th-medium-combined"  # モデル名を指定
lang = "th"  # 泰語に変更

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
  language=lang,
  task="transcribe"
)
text = pipe("audio.mp3")["text"] # 音声ファイルを入力して文字起こしを行う

高度な使用法

高度な使用法の例は現在提供されていません。

🔧 技術詳細

学習ハイパーパラメータ

学習中には以下のハイパーパラメータが使用されました。

学習率（learning_rate）：1e-05
学習バッチサイズ（train_batch_size）：16
評価バッチサイズ（eval_batch_size）：16
乱数シード（seed）：42
オプティマイザ（optimizer）：AdamW、betas=(0.9, 0.999)、epsilon=1e-08
学習率スケジューラの種類（lr_scheduler_type）：線形
学習率スケジューラのウォームアップステップ数（lr_scheduler_warmup_steps）：500
学習ステップ数（training_steps）：10000
混合精度学習（mixed_precision_training）：ネイティブ自動混合精度（Native AMP）

フレームワークのバージョン

Transformers 4.37.2
Pytorch 2.1.0
Datasets 2.16.1
Tokenizers 0.15.1

📄 ライセンス

このモデルは Apache-2.0 ライセンスで提供されています。

📚 ドキュメント

引用

BibTeX を使用して引用するには、以下のようにします。

@misc {thonburian_whisper_med,
    author       = { Atirut Boribalburephan, Zaw Htet Aung, Knot Pipatsrisawat, Titipat Achakulvisut },
    title        = { Thonburian Whisper: A fine-tuned Whisper model for Thai automatic speech recognition },
    year         = 2022,
    url          = { https://huggingface.co/biodatlab/whisper-th-medium-combined },
    doi          = { 10.57967/hf/0226 },
    publisher    = { Hugging Face }
}

情報テーブル

属性	詳細
モデルタイプ	微調整された Whisper モデル、泰語の自動音声認識用
学習データ	mozilla-foundation/common_voice_13_0、google/fleurs、および選択されたデータセット