whisper-small-spanish開源語音識別模型 - 免費部署精準完成西班牙語轉錄

首頁

Whisper Small Spanish

由clu-ling開發

該模型是基於OpenAI的whisper-small在Common Voice數據集v11西班牙語版本上微調的語音識別模型，專注於西班牙語轉錄任務。

語音識別

Transformers

開源協議:Apache-2.0 #西班牙語語音識別 #低詞錯誤率 #CommonVoice微調

下載量 298

發布時間 : 12/14/2022

模型概述

whisper-small-spanish是針對西班牙語優化的自動語音識別(ASR)模型，能夠將西班牙語語音準確轉錄為文本。

模型特點

西班牙語優化

專門針對西班牙語語音進行微調，相比原始whisper-small模型在西班牙語識別上有更好表現

低詞錯誤率

在Common Voice測試集上達到20.68%的詞錯誤率(WER)

高效訓練

使用混合精度訓練和線性學習率調度器優化訓練過程

模型能力

西班牙語語音識別

語音轉文本

長音頻處理

使用案例

語音轉錄

西班牙語會議記錄

將西班牙語會議錄音自動轉錄為文字記錄

準確率約80%

語音助手

為西班牙語語音助手提供語音識別能力

教育

語言學習輔助

幫助西班牙語學習者檢查發音準確性

🚀 whisper-small-sp

本模型是基於commonvoice dataset v11數據集對openai/whisper-small進行微調後的版本。它在評估集上取得了以下結果：

損失值：0.4485
詞錯誤率（Wer）：20.6842

🚀 快速開始

本模型可用於語音轉錄任務，以下是使用示例。

✨ 主要特性

基於微調的openai/whisper-small模型，在特定數據集上進行了優化。
提供了訓練超參數和訓練結果的詳細信息。
包含轉錄和評估的代碼示例。

📦 安裝指南

文檔未提及安裝步驟，暫不展示。

💻 使用示例

基礎用法

from datasets import load_dataset, Audio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# load the model
processor = WhisperProcessor.from_pretrained("clu-ling/whisper-small-spanish")
model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-small-spanish").to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="es", task="transcribe")

# load the dataset
commonvoice_eval = load_dataset("mozilla-foundation/common_voice_11_0", "es", split="validation", streaming=True)
commonvoice_eval = commonvoice_eval.cast_column("audio", Audio(sampling_rate=16000))
sample = next(iter(commonvoice_eval))["audio"]

# features and generate token ids
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to(device), forced_decoder_ids=forced_decoder_ids)

# decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription)

高級用法

from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from datasets import load_dataset, Audio
import evaluate
import torch
import re
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# metric
wer_metric = evaluate.load("wer")

# model
processor = WhisperProcessor.from_pretrained("clu-ling/whisper-small-spanish")
model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-small-spanish")

# dataset
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "es", split="test", )#cache_dir=args.cache_dir
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

#for debuggings: it gets some examples
#dataset = dataset.shard(num_shards=10000, index=0)
#print(dataset)
   
def normalize(batch):
  batch["gold_text"] = whisper_norm(batch['sentence'])
  return batch

def map_wer(batch):
  model.to(device)
  forced_decoder_ids = processor.get_decoder_prompt_ids(language = "es", task = "transcribe")
  inputs = processor(batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"], return_tensors="pt").input_features
  with torch.no_grad():
    generated_ids = model.generate(inputs=inputs.to(device), forced_decoder_ids=forced_decoder_ids)
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
  batch["predicted_text"] = whisper_norm(transcription)
  return batch

# process GOLD text
processed_dataset = dataset.map(normalize)
# get predictions
predicted = processed_dataset.map(map_wer)

# word error rate
wer = wer_metric.compute(references=predicted['gold_text'], predictions=predicted['predicted_text'])
wer = round(100 * wer, 2)
print("WER:", wer)

🔧 技術細節

訓練超參數

訓練過程中使用了以下超參數：

學習率：0.0005
訓練批次大小：16
評估批次大小：8
隨機種子：42
優化器：Adam（β1 = 0.9，β2 = 0.999，ε = 1e-08）
學習率調度器類型：線性
學習率調度器熱身步數：500
訓練步數：25000
混合精度訓練：原生自動混合精度（Native AMP）

訓練結果

訓練損失	輪數	步數	驗證損失	詞錯誤率（Wer）
2.2671	0.13	1000	2.2108	76.2667
1.4465	0.26	2000	1.6057	67.8753
1.0997	0.39	3000	1.1928	54.2433
0.9389	0.52	4000	1.0020	47.8307
0.7881	0.65	5000	0.8933	46.0046
0.7596	0.78	6000	0.7721	38.5595
0.5678	0.91	7000	0.6903	36.2897
0.4412	1.04	8000	0.6476	32.7473
0.4239	1.17	9000	0.5973	30.8142
0.3935	1.3	10000	0.5444	29.0208
0.3307	1.43	11000	0.5024	27.0434
0.2937	1.56	12000	0.4608	24.7318
0.2471	1.69	13000	0.4259	22.8940
0.2357	1.82	14000	0.3936	21.6018
0.2292	1.95	15000	0.3776	20.8004
0.1493	2.08	16000	0.4599	24.0491
0.1708	2.21	17000	0.4370	23.3443
0.1385	2.34	18000	0.4277	22.3171
0.1288	2.47	19000	0.4050	21.0118
0.1627	2.6	20000	0.4507	23.4004
0.1675	2.73	21000	0.4346	22.8261
0.159	2.86	22000	0.4179	22.2949
0.1458	2.99	23000	0.3978	21.0810
0.0487	3.12	24000	0.4456	20.8617
0.0401	3.25	25000	0.4485	20.6842