Whisper-large-v3-Cantonese Open-source Model - Free Automatic Cantonese Speech Recognition

Home

Whisper Large V3 Cantonese

Developed by khleeloo

A Cantonese automatic speech recognition model fine-tuned on Whisper v3, trained on the Common Voice 17 dataset

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Cantonese speech recognition #Whisper fine-tuning #Low character error rate

Downloads 25

Release Time : 12/4/2024

Model Overview

This model is a fine-tuned version of the Whisper v3 model, specifically trained for Cantonese (Yue) automatic speech recognition (ASR) tasks. Suitable for applications such as voice assistants and transcription services.

Model Features

Cantonese speech recognition

Speech recognition capabilities optimized specifically for Cantonese

Whisper v3 architecture

Based on OpenAI's powerful Whisper v3 model architecture

Efficient fine-tuning

Fine-tuned for 10 epochs on the Common Voice 17 dataset

Model Capabilities

Cantonese speech-to-text

Automatic speech recognition

Speech transcription

Use Cases

Voice assistants

Cantonese voice assistant

Provides voice interaction functionality for Cantonese users

Transcription services

Cantonese speech transcription

Converts Cantonese speech content into text

Accessibility features

Cantonese accessibility services

Provides speech-to-text accessibility features for Cantonese speakers

🚀 Whisper Cantonese Model

This is a fine - tuned Whisper v3 model for automatic speech recognition in Cantonese (Yue), offering high - quality performance for related applications.

🚀 Quick Start

To use this model, you can load it using the Hugging Face Transformers library:

from transformers import WhisperProcessor, WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("your_username/whisper-cantonese")
processor = WhisperProcessor.from_pretrained("your_username/whisper-cantonese")

✨ Features

Specifically fine - tuned for Cantonese (Yue) automatic speech recognition.
Trained on the Common Voice 17 dataset for 10 epochs with a learning rate of 1e - 7.
Can be used in various applications such as voice assistants, transcription services, and accessibility features for Cantonese speakers.

📦 Installation

This model can be loaded using the Hugging Face Transformers library. Ensure you have the transformers library installed in your Python environment. You can install it via the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("your_username/whisper-cantonese")
processor = WhisperProcessor.from_pretrained("your_username/whisper-cantonese")

# Assume you have an audio file 'audio.wav'
# Here is a simple example of processing audio
# Note: You need to add actual audio loading and pre - processing code
# audio = load_audio('audio.wav')
# input_features = processor(audio, return_tensors="pt").input_features
# predicted_ids = model.generate(input_features)
# transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

📚 Documentation

Model Details

Property	Details
Model Type	Whisper v3
Language	Cantonese (Yue)
Training Data	Common Voice 17
Training Duration	10 epochs
Learning Rate	1e - 7
Frozen Layers	12 layers in the decoder are frozen during training
Developed by	khleeloo (Rita Frieske)
License	apache - 2.0
Finetuned from model	openai/whisper - large - v3

Uses

This model is intended for researchers and developers interested in building applications that require speech recognition capabilities in Cantonese. It can be used in various applications, including voice assistants, transcription services, and accessibility features for Cantonese speakers.

Bias, Risks, and Limitations

⚠️ Important Note

The model is specifically fine - tuned for Cantonese and may not perform well on other languages or dialects. Performance may vary based on the quality and accent of the audio input. The model's effectiveness is dependent on the diversity and richness of the training data.

Training

Training Data

mozilla - foundation/common_voice_17_0

Evaluation

Testing Data, Factors & Metrics

Common Voice_17_0 yue test split, Common Voice 15_0 yue test split, and Common Voice 15_0 zh - HK test split (these test dataset were used to evaluate Whisper 3.0).

Metrics

Character Error Rate (CER) since Cantonese is a character - based language.

Results

	CV15_0 zh - HK	CV 15_0 yue	CV 17_0 yue
Whisper large v3	10.8	16	-
Whisper cantonese (ours)	18.88	8.77	7.26

Explanation: our model was not trained on zh - HK data consisting of more written Cantonese but rather more vernacular Cantonese version (yue) since it is a speech recognition model. Hence the weaker performance on zh - HK splits of the Common Voice dataset.

🔧 Technical Details

This model is a fine - tuned version of the Whisper v3 model for Cantonese (Yue) automatic speech recognition. It was fine - tuned on the Common Voice 17 dataset for 10 epochs with a learning rate of 1e - 7. During training, 12 layers in the decoder were frozen.

📄 License

This model is released under the apache - 2.0 license.

📚 Citation

BibTeX:

@misc {rita_frieske_2025,
	author       = { {Rita Frieske} },
	title        = { whisper-large-v3-cantonese },
	year         = 2025,
	url          = { https://huggingface.co/khleeloo/whisper-large-v3-cantonese },
	doi          = { 10.57967/hf/4393 },
	publisher    = { Hugging Face }
}

Model Card Authors

https://khleeloo.github.io/

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご