Distil-Whisper-Small-Cantonese Open-Source Cantonese Speech Recognition Model - Free and Precise Cantonese Speech-to-Text Conversion

Distil Whisper Small Cantonese

Developed by alvanlii

This is a distilled Cantonese speech recognition model based on Whisper Small, achieving a CER of 9.7 (without punctuation) on Common Voice 16.0.

Speech Recognition

Transformers

ChineseOpen Source License:Apache-2.0 #Cantonese speech recognition #Lightweight model #Low-resource inference

Downloads 187

Release Time : 4/3/2024

Model Overview

This model is a distilled version of Whisper Small, specifically optimized for Cantonese speech recognition tasks, featuring a smaller model size and faster inference speed.

Model Features

Efficient Inference

Compared to the original Whisper Small model, inference speed is improved by approximately 50%, with GPU VRAM requirements of only about 2GB.

Cantonese Optimization

Specifically trained and optimized for Cantonese speech recognition tasks.

Lightweight

Model compression achieved by reducing decoder layers, with parameters decreased from 242M to 157M.

Model Capabilities

Cantonese speech recognition

Speech-to-text

Audio transcription

Use Cases

Speech Transcription

Cantonese Meeting Minutes

Automatically transcribe Cantonese meeting recordings into text

Achieved a character error rate (CER) of 9.7% on the Common Voice 16.0 test set

Media Subtitle Generation

Automatically generate subtitles for Cantonese video content

🚀 Distil-Whisper Small zh-HK - Alvin

This model is a distilled version of the Whisper small model for Cantonese, offering efficient performance with reduced computational requirements.

🚀 Quick Start

The following code demonstrates how to use the Distil-Whisper Small zh-HK - Alvin model for automatic speech recognition:

Basic Usage

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/distil-whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

Advanced Usage

from transformers import pipeline
MODEL_NAME = "alvanlii/distil-whisper-small-cantonese" 
lang = "zh"
# Assume 'device' is properly defined
device = 0  # for example, use GPU 0
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
# Assume 'file' is a valid audio file path
file = 'audio.mp3'
text = pipe(file)["text"]

✨ Features

This model is a distilled version of alvanlii/whisper-small-cantonese on the Cantonese language.
Achieves a 9.7 CER (without punctuations), 11.59 CER (with punctuations) on Common Voice 16.0.
Has 3 decoder layers instead of the regular 12 of the Whisper small model.
Uses ~2GB of GPU VRAM.

📚 Documentation

Training and Evaluation Data

For training, the following datasets are used:

CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899 - 2906.
Cantonse - ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
Common Voice yue and zh - HK train sets

For evaluation, the Common Voice 16.0 yue Test set is used.

Comparisons to Whisper Small

Property	`alvanlii/distil-whisper-small-cantonese`	`alvanlii/whisper-small-cantonese`
CER (lower is better)	0.097	0.089
GPU Inference time (sdpa) [s/sample]	0.027	0.055
GPU Inference (regular) [s/sample]	0.027	0.308
CPU Inference [s/sample]	1.3	2.57
Params [M]	157	242

Note: inference time is calculated by taking the average inference time for the CV16 yue test set.

📄 License

This model is licensed under the Apache-2.0 license.

Additional Information

Property	Details
Model Type	Distilled Whisper Small for Cantonese
Training Data	CantoMap, Cantonse - ASR, Common Voice yue and zh - HK train sets
Evaluation Data	Common Voice 16.0 yue Test set
Base Model	openai/whisper-small
Datasets Used	mozilla-foundation/common_voice_11_0, mozilla-foundation/common_voice_16_0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご