whisper-th-large-v3-combined開源泰語語音識別模型 - 低錯誤率精準識別泰語語音

首頁

Whisper Th Large V3 Combined

由biodatlab開發

這是一個基於 OpenAI 的 Whisper Large V3 模型微調的泰語自動語音識別模型，在 Common Voice 13 泰語測試集上取得了 6.59% 的詞錯誤率。

語音識別

Transformers

開源協議:Apache-2.0 #泰語語音識別 #低詞錯誤率 #多數據集微調

下載量 1,354

發布時間 : 2/20/2024

模型概述

該模型是針對泰語優化的自動語音識別(ASR)模型，在增強版的 Common Voice 13 和 FLEURS 數據集上進行微調，專門用於泰語語音轉錄任務。

模型特點

低詞錯誤率

在 Common Voice 13 泰語測試集上僅 6.59% 的詞錯誤率(WER)

泰語優化

專門針對泰語語音特性進行微調

混合數據集訓練

使用 Common Voice 13 和 FLEURS 等多個數據集增強訓練

模型能力

泰語語音識別

音頻轉錄

長音頻處理(支持30秒分塊)

使用案例

語音轉錄

泰語會議記錄

將泰語會議錄音自動轉錄為文字

高準確率的轉錄文本

泰語媒體字幕生成

為泰語視頻內容自動生成字幕

🚀 Whisper Large V3（泰語）：組合版本V1

本模型是基於 openai/whisper-medium 在增強版的 mozilla-foundation/common_voice_13_0 泰語數據集、google/fleurs 數據集以及精心挑選的數據集上進行微調得到的。它在 common-voice-13 測試集上取得了以下成績：

字錯率（WER）：6.59（使用 Deepcut 分詞器）

🚀 快速開始

使用 Hugging Face 的 transformers 庫調用該模型的示例代碼如下：

from transformers import pipeline

MODEL_NAME = "biodatlab/whisper-th-large-v3-combined"  # 指定模型名稱
lang = "th"  # 切換為泰語

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
  language=lang,
  task="transcribe"
)
text = pipe("audio.mp3")["text"] # 傳入音頻文件並進行轉錄

💻 使用示例

基礎用法

from transformers import pipeline

MODEL_NAME = "biodatlab/whisper-th-large-v3-combined"  # 指定模型名稱
lang = "th"  # 切換為泰語

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
  language=lang,
  task="transcribe"
)
text = pipe("audio.mp3")["text"] # 傳入音頻文件並進行轉錄

📚 詳細文檔

模型描述

該模型是 openai/whisper-medium 的微調版本，在增強版的 mozilla-foundation/common_voice_13_0 泰語數據集、google/fleurs 數據集以及精心挑選的數據集上進行訓練。在 common-voice-13 測試集上，其字錯率（WER）為 6.59（使用 Deepcut 分詞器）。

預期用途與限制

更多信息待補充。

訓練和評估數據

更多信息待補充。

訓練過程

訓練超參數

訓練過程中使用了以下超參數：

屬性	詳情
學習率	1e-05
訓練批次大小	16
評估批次大小	16
隨機種子	42
優化器	AdamW（β1 = 0.9，β2 = 0.999，ε = 1e-08）
學習率調度器類型	線性
學習率調度器熱身步數	500
訓練步數	10000
混合精度訓練	原生自動混合精度（Native AMP）

框架版本

Transformers 4.37.2
Pytorch 2.1.0
Datasets 2.16.1
Tokenizers 0.15.1

引用

使用 BibTeX 引用該模型：

@misc {thonburian_whisper_med,
    author       = { Atirut Boribalburephan, Zaw Htet Aung, Knot Pipatsrisawat, Titipat Achakulvisut },
    title        = { Thonburian Whisper: A fine-tuned Whisper model for Thai automatic speech recognition },
    year         = 2022,
    url          = { https://huggingface.co/biodatlab/whisper-th-medium-combined },
    doi          = { 10.57967/hf/0226 },
    publisher    = { Hugging Face }
}