🚀 Distil-Whisper Small zh-HK - Alvin
This model is a distilled version of the Whisper small model for Cantonese, offering efficient performance with reduced computational requirements.
🚀 Quick Start
The following code demonstrates how to use the Distil-Whisper Small zh-HK - Alvin
model for automatic speech recognition:
Basic Usage
import librosa
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
y, sr = librosa.load('audio.mp3', sr=16000)
MODEL_NAME = "alvanlii/distil-whisper-small-cantonese"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False
processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
input_features=processed_in.input_features,
output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)
Advanced Usage
from transformers import pipeline
MODEL_NAME = "alvanlii/distil-whisper-small-cantonese"
lang = "zh"
device = 0
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
chunk_length_s=30,
device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
file = 'audio.mp3'
text = pipe(file)["text"]
✨ Features
- This model is a distilled version of alvanlii/whisper-small-cantonese on the Cantonese language.
- Achieves a 9.7 CER (without punctuations), 11.59 CER (with punctuations) on Common Voice 16.0.
- Has 3 decoder layers instead of the regular 12 of the Whisper small model.
- Uses ~2GB of GPU VRAM.
📚 Documentation
Training and Evaluation Data
For training, the following datasets are used:
- CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899 - 2906.
- Cantonse - ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
- Common Voice yue and zh - HK train sets
For evaluation, the Common Voice 16.0 yue Test set is used.
Comparisons to Whisper Small
Property |
alvanlii/distil-whisper-small-cantonese |
alvanlii/whisper-small-cantonese |
CER (lower is better) |
0.097 |
0.089 |
GPU Inference time (sdpa) [s/sample] |
0.027 |
0.055 |
GPU Inference (regular) [s/sample] |
0.027 |
0.308 |
CPU Inference [s/sample] |
1.3 |
2.57 |
Params [M] |
157 |
242 |
Note: inference time is calculated by taking the average inference time for the CV16 yue test set.
📄 License
This model is licensed under the Apache-2.0 license.
Additional Information
Property |
Details |
Model Type |
Distilled Whisper Small for Cantonese |
Training Data |
CantoMap, Cantonse - ASR, Common Voice yue and zh - HK train sets |
Evaluation Data |
Common Voice 16.0 yue Test set |
Base Model |
openai/whisper-small |
Datasets Used |
mozilla-foundation/common_voice_11_0, mozilla-foundation/common_voice_16_0 |