whisper-small-cantonese開源粵語語音識別模型

首頁

Whisper Small Cantonese

由alvanlii開發

基於OpenAI Whisper-small微調的粵語語音識別模型，在Common Voice 16.0測試集上CER為7.93

語音識別

Transformers

支持多種語言開源協議:Apache-2.0 #粵語語音識別 #低CER #快速推理

下載量 2,413

發布時間 : 12/8/2022

模型概述

專為粵語優化的自動語音識別模型，支持高效準確的粵語語音轉文字

模型特點

優化的粵語識別

針對粵語特點進行專門微調，字符錯誤率(CER)低至7.93

高效推理

支持Flash Attention加速，單樣本處理僅需0.055秒

多格式支持

提供GGML和CT2格式，兼容Whisper.cpp和WhisperX等工具

推測性解碼支持

可作為輔助模型加速大模型的推理過程

模型能力

粵語語音識別

中文語音識別

快速語音轉文字

長音頻處理（支持分塊）

使用案例

語音轉錄

粵語視頻字幕生成

為粵語視頻內容自動生成準確的字幕

CER 7.93的識別準確率

語音助手

構建支持粵語的語音交互應用

快速響應(0.055秒/樣本)

語音分析

粵語語音數據分析

對粵語語音內容進行轉寫和分析

支持多種粵語數據集格式

🚀 粵語版小模型Whisper - Alvin

本模型是基於粵語對 openai/whisper-small 進行微調的版本。在 Common Voice 16.0 數據集上，其字符錯誤率（CER）在無標點時為 7.93%，有標點時為 9.72%。

✨ 主要特性

基於預訓練模型 openai/whisper-small 進行粵語微調。
在多個粵語數據集上進行訓練和評估，具有較好的粵語語音識別性能。
支持多種推理加速方法，如 Flash Attention 和 Speculative Decoding。

📦 安裝指南

文檔未提及安裝步驟，暫不提供。

💻 使用示例

基礎用法

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

高級用法

使用 huggingface pipelines 進行推理：

from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese" 
lang = "zh"
device = 0  # 假設使用 GPU 進行推理
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe('audio.mp3')["text"]

📚 詳細文檔

訓練和評估數據

訓練數據

CantoMap：Winterstein, Grégoire, Tang, Carmen 和 Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus"，發表於 The 12th Language Resources and Evaluation Conference 會議論文集，Marseille: European Language Resources Association, p. 2899 - 2906。
Cantonse - ASR：Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset"，2022 年。鏈接：https://arxiv.org/pdf/2201.02419.pdf

名稱	時長（小時）
Common Voice 16.0 zh - HK Train	138
Common Voice 16.0 yue Train	85
Common Voice 17.0 yue Train	178
Cantonese - ASR	72
CantoMap	23
Pseudo - Labelled YouTube Data	438

評估數據

使用 Common Voice 16.0 yue 測試集進行評估。

評估結果

字符錯誤率（CER，越低越好）：
- 無標點：0.0793
- 有標點：0.0972，較之前版本的 0.1073 和 0.1581 有所下降
GPU 推理（使用 Fast Attention，示例如下）：每個樣本 0.055 秒
- 注意：所有 GPU 評估均在 RTX 3090 GPU 上進行
GPU 推理：每個樣本 0.308 秒
CPU 推理：每個樣本 2.57 秒
GPU 顯存佔用：約 1.5 GB

模型加速

只需添加 attn_implementation="sdpa" 即可使用 Flash Attention 進行加速。

from transformers import AutoModelForSpeechSeq2Seq
import torch

torch_dtype = torch.float16
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "alvanlii/whisper-small-cantonese",
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

使用 Flash Attention 後，每個樣本的推理時間從 0.308 秒減少到 0.055 秒。

推測解碼

可以使用更大的模型，然後使用 alvanlii/whisper-small-cantonese 加速推理，且基本不損失準確性。

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

torch_dtype = torch.float16
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "simonl0909/whisper-large-v2-cantonese"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

assistant_model_id = "alvanlii/whisper-small-cantonese"

assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

assistant_model.to(device)
# 假設 inputs 是預處理後的輸入
inputs = processor(...)
model.generate(**inputs, use_cache=True, assistant_model=assistant_model)

原始的 simonl0909/whisper-large-v2-cantonese 模型每個樣本推理時間為 0.714 秒，CER 為 7.65%。使用 alvanlii/whisper-small-cantonese 進行推測解碼後，每個樣本推理時間為 0.137 秒，CER 為 7.67%，速度大幅提升。

Whisper.cpp

截至 2024 年 6 月，已上傳用於 Whisper cpp 的 GGML 二進制文件。可以從這裡下載二進制文件，並在這裡進行測試。

Whisper CT2

若要在 WhisperX 或 FasterWhisper 中使用，需要 CT2 文件。轉換後的模型文件位於這裡。

訓練超參數

屬性	詳情
學習率	5e - 5
訓練批次大小	25（在 1 塊 3090 GPU 上）
評估批次大小	8
梯度累積步數	4
總訓練批次大小	25 x 4 = 100
優化器	Adam，beta=(0.9, 0.999)，epsilon = 1e - 08
學習率調度器類型	線性
學習率調度器熱身步數	500
訓練步數	15000
數據增強	無