distil-whisper-small-cantonese开源粤语语音识别模型 - 免费实现精准粤语语音转文字

首页

Distil Whisper Small Cantonese

由 alvanlii 开发

这是一个基于Whisper Small的粤语语音识别蒸馏模型，在Common Voice 16.0上实现了9.7的CER（无标点符号）。

语音识别

Transformers

中文开源协议:Apache-2.0 #粤语语音识别 #轻量级模型 #低资源推理

下载量 187

发布时间 : 4/3/2024

模型简介

该模型是Whisper Small的蒸馏版本，专门针对粤语语音识别任务进行了优化，具有更小的模型尺寸和更快的推理速度。

模型特点

高效推理

相比原版Whisper Small模型，推理速度提升约50%，GPU VRAM需求仅约2GB

粤语优化

专门针对粤语语音识别任务进行了训练和优化

轻量级

通过减少解码器层数实现了模型压缩，参数量从242M减少到157M

模型能力

粤语语音识别

语音转文字

音频转录

使用案例

语音转录

粤语会议记录

将粤语会议录音自动转录为文字

在Common Voice 16.0测试集上达到9.7%的字符错误率(CER)

媒体字幕生成

为粤语视频内容自动生成字幕

🚀 Distil-Whisper Small zh-HK - Alvin

本模型是粤语版的精简模型，在粤语语音识别任务中有着出色的表现。它基于alvanlii/whisper-small-cantonese进行蒸馏，减少了模型的复杂度，同时保持了较高的识别准确率。

🚀 快速开始

本模型是alvanlii/whisper-small-cantonese粤语版本的蒸馏模型。

在Common Voice 16.0上，无标点的字符错误率（CER）为9.7，有标点的CER为11.59。
相较于常规的Whisper small模型的12个解码器层，本模型仅有3个解码器层。
仅需约2GB的GPU显存。

✨ 主要特性

精简架构：通过蒸馏技术减少了解码器层数，降低了模型复杂度。
低显存需求：仅需约2GB的GPU显存，适合在资源有限的环境中运行。
高准确率：在粤语语音识别任务中取得了较低的字符错误率。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/distil-whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

高级用法

from transformers import pipeline
MODEL_NAME = "alvanlii/distil-whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]

📚 详细文档

训练和评估数据

训练数据

CantoMap：Winterstein, Grégoire, Tang, Carmen和Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus"，发表于The 12th Language Resources and Evaluation Conference会议论文集，Marseille: European Language Resources Association, p. 2899 - 2906。
Cantonse - ASR：Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset"，2022。链接：https://arxiv.org/pdf/2201.02419.pdf
Common Voice粤语和zh - HK训练集

评估数据

使用Common Voice 16.0粤语测试集进行评估。

与Whisper Small的比较

指标	`alvanlii/distil-whisper-small-cantonese`	`alvanlii/whisper-small-cantonese`
字符错误率（CER，越低越好）	0.097	0.089
GPU推理时间（sdpa）[秒/样本]	0.027	0.055
GPU推理时间（常规）[秒/样本]	0.027	0.308
CPU推理时间[秒/样本]	1.3	2.57
参数数量[M]	157	242