🚀 whisper-large-v2-japanese-5k-steps
本模型是在日語CommonVoice數據集(v11)上對openai/whisper-large-v2進行微調後的版本。它在評估集上取得了以下成績:
- 損失值:0.4200
- 詞錯誤率(Wer):0.7449
🚀 快速開始
本模型是為研究目的進行了5000步微調的版本,這意味著轉錄結果可能無法讓用戶完全滿意。
✨ 主要特性
- 基於
openai/whisper-large-v2
模型進行微調,適配日語語音識別任務。
- 在日語CommonVoice數據集(v11)上進行訓練和評估。
📦 安裝指南
文檔中未提及安裝步驟,故跳過此章節。
💻 使用示例
基礎用法
以下代碼展示瞭如何使用該模型進行轉錄:
from datasets import load_dataset, Audio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = WhisperProcessor.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps")
model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps").to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ja", task="transcribe")
commonvoice_eval = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="validation", streaming=True)
commonvoice_eval = commonvoice_eval.cast_column("audio", Audio(sampling_rate=16000))
sample = next(iter(commonvoice_eval))["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to(device), forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
高級用法
以下代碼展示瞭如何在mozilla-foundation/common_voice_11_0
測試集上評估該模型:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from datasets import load_dataset, Audio
import evaluate
import torch
import re
from transformers import WhisperProcessor, WhisperForConditionalGeneration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
wer_metric = evaluate.load("wer")
processor = WhisperProcessor.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps")
model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps")
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="test", )
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
def normalize(batch):
batch["gold_text"] = whisper_norm(batch['sentence'])
return batch
def map_wer(batch):
model.to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language = "ja", task = "transcribe")
inputs = processor(batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"], return_tensors="pt").input_features
with torch.no_grad():
generated_ids = model.generate(inputs=inputs.to(device), forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
batch["predicted_text"] = whisper_norm(transcription)
return batch
processed_dataset = dataset.map(normalize)
predicted = processed_dataset.map(map_wer)
wer = wer_metric.compute(references=predicted['gold_text'], predictions=predicted['predicted_text'])
wer = round(100 * wer, 2)
print("WER:", wer)
📚 詳細文檔
訓練和評估數據
屬性 |
詳情 |
訓練數據 |
CommonVoice(v11)訓練集 |
驗證數據 |
CommonVoice(v11)驗證集 |
測試數據 |
CommonVoice(v11)測試集 |
訓練過程
訓練超參數
訓練過程中使用了以下超參數:
- 學習率:1e-05
- 訓練批次大小:50
- 評估批次大小:16
- 隨機種子:42
- 優化器:Adam(β1=0.9,β2=0.999,ε=1e-08)
- 學習率調度器類型:線性
- 學習率調度器熱身步數:500
- 訓練步數:5000
- 混合精度訓練:Native AMP
訓練結果
訓練損失 |
輪數 |
步數 |
驗證損失 |
詞錯誤率(Wer) |
0.0111 |
7.63 |
1000 |
0.3210 |
0.7888 |
0.0007 |
15.27 |
2000 |
0.3585 |
0.7478 |
0.0003 |
22.9 |
3000 |
0.3937 |
0.7432 |
0.0002 |
30.53 |
4000 |
0.4123 |
0.7443 |
0.0002 |
38.17 |
5000 |
0.4200 |
0.7449 |
框架版本
- Transformers 4.26.0.dev0
- Pytorch 1.13.1
- Datasets 2.8.1.dev0
- Tokenizers 0.13.2
📄 許可證
本項目採用Apache-2.0許可證。