distil-whisper-small-cantonese開源粵語語音識別模型 - 免費實現精準粵語語音轉文字

首頁

Distil Whisper Small Cantonese

由alvanlii開發

這是一個基於Whisper Small的粵語語音識別蒸餾模型，在Common Voice 16.0上實現了9.7的CER（無標點符號）。

語音識別

Transformers

中文開源協議:Apache-2.0 #粵語語音識別 #輕量級模型 #低資源推理

下載量 187

發布時間 : 4/3/2024

模型概述

該模型是Whisper Small的蒸餾版本，專門針對粵語語音識別任務進行了優化，具有更小的模型尺寸和更快的推理速度。

模型特點

高效推理

相比原版Whisper Small模型，推理速度提升約50%，GPU VRAM需求僅約2GB

粵語優化

專門針對粵語語音識別任務進行了訓練和優化

輕量級

通過減少解碼器層數實現了模型壓縮，參數量從242M減少到157M

模型能力

粵語語音識別

語音轉文字

音頻轉錄

使用案例

語音轉錄

粵語會議記錄

將粵語會議錄音自動轉錄為文字

在Common Voice 16.0測試集上達到9.7%的字符錯誤率(CER)

媒體字幕生成

為粵語視頻內容自動生成字幕

🚀 Distil-Whisper Small zh-HK - Alvin

本模型是粵語版的精簡模型，在粵語語音識別任務中有著出色的表現。它基於alvanlii/whisper-small-cantonese進行蒸餾，減少了模型的複雜度，同時保持了較高的識別準確率。

🚀 快速開始

本模型是alvanlii/whisper-small-cantonese粵語版本的蒸餾模型。

在Common Voice 16.0上，無標點的字符錯誤率（CER）為9.7，有標點的CER為11.59。
相較於常規的Whisper small模型的12個解碼器層，本模型僅有3個解碼器層。
僅需約2GB的GPU顯存。

✨ 主要特性

精簡架構：通過蒸餾技術減少了解碼器層數，降低了模型複雜度。
低顯存需求：僅需約2GB的GPU顯存，適合在資源有限的環境中運行。
高準確率：在粵語語音識別任務中取得了較低的字符錯誤率。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/distil-whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

高級用法

from transformers import pipeline
MODEL_NAME = "alvanlii/distil-whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]

📚 詳細文檔

訓練和評估數據

訓練數據

CantoMap：Winterstein, Grégoire, Tang, Carmen和Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus"，發表於The 12th Language Resources and Evaluation Conference會議論文集，Marseille: European Language Resources Association, p. 2899 - 2906。
Cantonse - ASR：Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset"，2022。鏈接：https://arxiv.org/pdf/2201.02419.pdf
Common Voice粵語和zh - HK訓練集

評估數據

使用Common Voice 16.0粵語測試集進行評估。

與Whisper Small的比較

指標	`alvanlii/distil-whisper-small-cantonese`	`alvanlii/whisper-small-cantonese`
字符錯誤率（CER，越低越好）	0.097	0.089
GPU推理時間（sdpa）[秒/樣本]	0.027	0.055
GPU推理時間（常規）[秒/樣本]	0.027	0.308
CPU推理時間[秒/樣本]	1.3	2.57
參數數量[M]	157	242