Whisper-small-cantonese Open-source Cantonese Speech Recognition Model - Free Deployment for Accurate Cantonese Recognition

Whisper Small Cantonese

Developed by alvanlii

A Cantonese speech recognition model fine-tuned based on OpenAI Whisper-small, achieving a CER of 7.93 on the Common Voice 16.0 test set

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Cantonese Speech Recognition #Low CER #Fast Inference

Downloads 2,413

Release Time : 12/8/2022

Model Overview

An automatic speech recognition model optimized for Cantonese, supporting efficient and accurate Cantonese speech-to-text conversion

Model Features

Optimized Cantonese Recognition

Specially fine-tuned for Cantonese characteristics, achieving a character error rate (CER) as low as 7.93

Efficient Inference

Supports Flash Attention acceleration, processing a single sample in just 0.055 seconds

Multi-format Support

Provides GGML and CT2 formats, compatible with tools like Whisper.cpp and WhisperX

Speculative Decoding Support

Can serve as an auxiliary model to accelerate the inference process of larger models

Model Capabilities

Cantonese Speech Recognition

Chinese Speech Recognition

Fast Speech-to-Text Conversion

Long Audio Processing (supports chunking)

Use Cases

Speech Transcription

Cantonese Video Subtitle Generation

Automatically generates accurate subtitles for Cantonese video content

Recognition accuracy with CER 7.93

Voice Assistant

Builds Cantonese-supported voice interaction applications

Fast response (0.055 seconds/sample)

Speech Analysis

Cantonese Speech Data Analysis

Transcribes and analyzes Cantonese speech content

Supports multiple Cantonese dataset formats

🚀 Whisper Small Cantonese - Alvin

This model is a fine - tuned version of openai/whisper-small on the Cantonese language. It offers high - quality automatic speech recognition for Cantonese, achieving a 7.93 CER (without punctuations) and 9.72 CER (with punctuations) on Common Voice 16.0.

✨ Features

Fine - Tuned for Cantonese: Based on the openai/whisper-small model, it's optimized for the Cantonese language.
High Accuracy: Achieves low CER scores on relevant datasets.
Multiple Optimization Methods: Supports Flash Attention and Speculative Decoding for speedup.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

Advanced Usage

from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese" 
lang = "zh"
device = 0  # You may need to adjust this according to your environment
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe('audio.mp3')["text"]

📚 Documentation

Training and evaluation data

For training, the following datasets are used:

CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899 - 2906.
Cantonse - ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf

Name	# of Hours
Common Voice 16.0 zh - HK Train	138
Common Voice 16.0 yue Train	85
Common Voice 17.0 yue Train	178
Cantonese - ASR	72
CantoMap	23
Pseudo - Labelled YouTube Data	438

For evaluation, the Common Voice 16.0 yue Test set is used.

Results

CER (lower is better): 0.0972
- down from 0.1073, 0.1581 in the previous versions
CER (punctuations removed): 0.0793
GPU Inference with Fast Attention (example below): 0.055s/sample
- Note all GPU evaluations are done on RTX 3090 GPU
GPU Inference: 0.308s/sample
CPU Inference: 2.57s/sample
GPU VRAM: ~1.5 GB

Model Speedup

Just add attn_implementation="sdpa" for Flash Attention.

from transformers import AutoModelForSpeechSeq2Seq
import torch

torch_dtype = torch.float16  # You may need to adjust this according to your environment
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "alvanlii/whisper-small-cantonese",
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

Using Flash Attention reduced the amount of time taken per sample from 0.308s to 0.055s.

Speculative Decoding

You can use a bigger model, then use alvanlii/whisper-small-cantonese to speed up inference with basically no loss in accuracy.

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

torch_dtype = torch.float16  # You may need to adjust this according to your environment
device = 0  # You may need to adjust this according to your environment

model_id = "simonl0909/whisper-large-v2-cantonese"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

assistant_model_id = "alvanlii/whisper-small-cantonese"

assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

assistant_model.to(device)
# Assume 'inputs' is properly defined
model.generate(**inputs, use_cache=True, assistant_model=assistant_model)

In the original simonl0909/whisper-large-v2-cantonese model, it runs at 0.714s/sample for a CER of 7.65. Using speculative decoding with alvanlii/whisper-small-cantonese, it runs at 0.137s/sample for a CER of 7.67, which is much faster.

Whisper.cpp

Uploaded a GGML bin file for Whisper cpp as of June 2024. You can download the bin file here and try it out here.

Whisper CT2

For use in WhisperX or FasterWhisper, a CT2 file is needed. The converted model is under here

Training Hyperparameters

Property	Details
learning_rate	5e - 5
train_batch_size	25 (on 1 3090 GPU)
eval_batch_size	8
gradient_accumulation_steps	4
total_train_batch_size	25x4 = 100
optimizer	Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
lr_scheduler_type	linear
lr_scheduler_warmup_steps	500
training_steps	15000
augmentation	None

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご