Whisper-finetune-Teochew Open-source Model - Supports Orthographic Transcription of Teochew with Multi-dialect Accents

Whisper Finetune Teochew

Developed by panlr

A Teochew (Chaoshan) orthographic recognition model fine-tuned based on Whisper-medium, supporting multi-dialect accent orthographic transcription

Speech Recognition

Safetensors

Chinese#Teochew speech recognition #Dai Kan orthography #Multi-accent support

Downloads 20

Release Time : 3/17/2025

Model Overview

This model is specifically designed for automatic speech recognition of Teochew (Chaoshan) dialect, using an innovative Dai Kan orthography annotation to avoid homophone ambiguity issues.

Model Features

Multi-dialect support

Covers various accents including Teochew prefectural city, Shantou urban area, southern Chao'an, Chenghai, and Rongjiang pronunciations

Dai Kan orthography

Employs an innovative annotation scheme to resolve homophone ambiguity (e.g., using 【介】 instead of easily confused 【个】)

Field recording data

Trained on 18.9 hours of real-world recordings containing 12,500 annotated samples

Model Capabilities

Teochew speech-to-text

Multi-accent recognition

Orthographic transcription

Use Cases

Dialect preservation

Teochew speech archiving

Converting orally transmitted Teochew recordings into standardized written records

CER 12.254% (test set)

Voice interaction

Dialect voice assistant

Supporting Teochew voice input for smart device interaction

🚀 Teochew Whisper Fine-tuned Model

This is a fine-tuned version of Whisper-medium for accurate orthographic recognition of Teochew dialect, not translation into Mandarin.

🚀 Quick Start

This model is a fine-tuned version of Whisper-medium, designed for the orthographic recognition of the Teochew dialect (not translating it into Mandarin). The fine-tuning code is sourced from the GitHub repository of yeyupiaoling.

🔗 Online Demo

teochew_whisper

📦 Fine-tuning Data

The data for fine-tuning training is sourced from teochew-wild. This is the first open-source, in-the-wild, and accurately orthographically annotated multi-speaker Teochew dialect dataset. It contains approximately 18.9 hours and a total of 12,500 Teochew audio segments, covering various accents such as those from the ancient city of Chaozhou, Shantou urban area, southern Chao'an, Chenghai, and the Rongjiang accent.

To reduce issues such as literal ambiguity, excessive polyphonic characters, and synonymous variant characters, the annotation of this dataset uses the self-created 歹看正字法 instead of the commonly used homophonic characters or the original characters verified by experts.

This is because in the homophonic character or expert solutions, ambiguities are very likely to occur. For example:

If 【个】 is used to represent 【的】, then for 【有个人】, does it mean 【有一个人】 or 【有的人】? Therefore, this dataset uses 【介】 instead of 【个】.
If 【只】 is used to represent 【这】, then 【这只猫】 and 【这只车】 would be written as 【只只猫】 and 【只只车】, which looks very strange. Therefore, this dataset uses the complex variant character 【祇】 to represent 【这】, and other cases are the same as in Mandarin.

📊 Evaluation Results

I randomly divided the 12,500 data samples into a training set, a validation set, and a test set, with 11,000, 700, and 700 samples respectively. After approximately 10 epochs of fine-tuning training on an RTX 3090, using the Character Error Rate (CER) as the evaluation metric, the results are as follows: (When conducting experiments in the paper, some homophonic characters in the labels were unified, such as 【仔】 and 【囝】, 【二】 and 【两】, so better results were obtained)

Data Subset	CER (%)
Validation Set	12.865
Test Set	12.254

📄 License

This project is licensed under the CC BY 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご