whisper-large-v2-japanese-5k-stepsオープンソース音声認識モデル

ホーム

Whisper Large V2 Japanese 5k Steps

clu-lingによって開発

OpenAIのwhisper-large-v2モデルを日本語CommonVoiceデータセットでファインチューニングした音声認識モデル、5000ステップ訓練、単語誤り率0.7449

音声認識

Transformers

日本語オープンソースライセンス:Apache-2.0 #日本語音声転写 #Whisperモデルのファインチューニング #低リソーストレーニング

ダウンロード数 144

リリース時間 : 1/28/2023

モデル概要

日本語音声認識タスク向けに最適化されたWhisperモデルで、日本語音声転写シーンに適しています

モデル特徴

日本語最適化

日本語音声認識に特化してファインチューニング

軽量ファインチューニング

わずか5000ステップの訓練で研究用途に適しています

Whisperアーキテクチャ

OpenAIの強力なWhisper-large-v2モデルを基にしています

モデル能力

日本語音声認識

音声から文字へ変換

使用事例

音声転写

日本語音声から文字へ

日本語音声内容をテキストに転写

単語誤り率0.7449

🚀 whisper-large-v2-japanese-5k-steps

このモデルは、日本語のCommonVoiceデータセット（v11）でopenai/whisper-large-v2をファインチューニングしたバージョンです。評価セットでは以下の結果を達成しています。

損失: 0.4200
単語誤り率（Wer）: 0.7449

✨ 主な機能

このモデルは研究目的で5000ステップファインチューニングされています。これは、ユーザーにとって文字起こしの結果が十分に満足いくものではない可能性があることを意味します。

📚 ドキュメント

学習と評価データ

学習データ: CommonVoice (v11) 学習分割
検証データ: CommonVoice (v11) 検証分割
テストデータ: CommonVoice (v11) テスト分割

学習手順

学習ハイパーパラメータ

学習中に以下のハイパーパラメータが使用されました。

学習率: 1e-05
学習バッチサイズ: 50
評価バッチサイズ: 16
シード: 42
オプティマイザ: Adam（ベータ=(0.9,0.999)、イプシロン=1e-08）
学習率スケジューラタイプ: 線形
学習率スケジューラウォームアップステップ: 500
学習ステップ: 5000
混合精度学習: Native AMP

学習結果

学習損失	エポック	ステップ	検証損失	単語誤り率（Wer）
0.0111	7.63	1000	0.3210	0.7888
0.0007	15.27	2000	0.3585	0.7478
0.0003	22.9	3000	0.3937	0.7432
0.0002	30.53	4000	0.4123	0.7443
0.0002	38.17	5000	0.4200	0.7449

💻 使用例

基本的な使用法

from datasets import load_dataset, Audio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# load the model
processor = WhisperProcessor.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps")
model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps").to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ja", task="transcribe")

# load the dataset
commonvoice_eval = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="validation", streaming=True)
commonvoice_eval = commonvoice_eval.cast_column("audio", Audio(sampling_rate=16000))
sample = next(iter(commonvoice_eval))["audio"]

# features and generate token ids
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to(device), forced_decoder_ids=forced_decoder_ids)

# decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription)

高度な使用法

from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from datasets import load_dataset, Audio
import evaluate
import torch
import re
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# metric
wer_metric = evaluate.load("wer")

# model
processor = WhisperProcessor.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps")
model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps")

# dataset
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="test", ) #cache_dir=args.cache_dir
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

#for debuggings: it gets some examples
#dataset = dataset.shard(num_shards=7000, index=0)
#print(dataset)
   
def normalize(batch):
  batch["gold_text"] = whisper_norm(batch['sentence'])
  return batch

def map_wer(batch):
  model.to(device)
  forced_decoder_ids = processor.get_decoder_prompt_ids(language = "ja", task = "transcribe")
  inputs = processor(batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"], return_tensors="pt").input_features
  with torch.no_grad():
    generated_ids = model.generate(inputs=inputs.to(device), forced_decoder_ids=forced_decoder_ids)
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
  batch["predicted_text"] = whisper_norm(transcription)
  return batch

# process GOLD text
processed_dataset = dataset.map(normalize)
# get predictions
predicted = processed_dataset.map(map_wer)

# word error rate
wer = wer_metric.compute(references=predicted['gold_text'], predictions=predicted['predicted_text'])
wer = round(100 * wer, 2)
print("WER:", wer)