Open-source Model wav2vec2-large-xls-r-300m-cantonese - Accurately Handle Cantonese Speech Recognition Tasks

Wav2vec2 Large Xls R 300m Cantonese

Developed by ivanlau

This is an automatic speech recognition (ASR) model fine-tuned on Cantonese (Hong Kong) datasets based on the facebook/wav2vec2-xls-r-300m model, specifically designed for Cantonese speech recognition tasks.

Speech Recognition

Transformers

ChineseOpen Source License:Apache-2.0 #Cantonese speech recognition #Low word error rate #Multi-dialect support

Downloads 42

Release Time : 3/2/2022

Model Overview

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - ZH-HK dataset, primarily used for Cantonese (Hong Kong) speech recognition tasks.

Model Features

Cantonese speech recognition

Speech recognition capability specifically optimized for Hong Kong Cantonese

Based on XLS-R architecture

Uses facebook's wav2vec2-xls-r-300m model as the foundation, with powerful speech feature extraction capabilities

Multi-dataset evaluation

Evaluated on multiple datasets including Common Voice 8 and Robust Speech Event

Model Capabilities

Cantonese speech-to-text

Automatic speech recognition

Speech content transcription

Use Cases

Speech transcription

Cantonese speech content transcription

Convert Cantonese speech content into text

Achieved WER of 0.8111 and CER of 0.2196 on the Common Voice 8 test set

Voice assistant

Cantonese voice command recognition

Recognize and understand Cantonese voice commands

🚀 XLS-R-300M - Chinese_HongKong (Cantonese)

This is a fine - tuned automatic speech recognition model based on the XLS - R - 300M architecture. It is trained on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - ZH - HK dataset, aiming to provide high - quality speech recognition services for Cantonese in Hong Kong.

🚀 Quick Start

This model is a fine - tuned version of [facebook/wav2vec2 - xls - r - 300m](https://huggingface.co/facebook/wav2vec2 - xls - r - 300m) on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - ZH - HK dataset. It achieves the following results on the evaluation set:

Loss: 1.4848
Wer: 0.8004

📚 Documentation

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 32
eval_batch_size: 16
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 100.0
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
No log	1.0	183	47.8442	1.0
No log	2.0	366	6.3109	1.0
41.8902	3.0	549	6.2392	1.0
41.8902	4.0	732	5.9739	1.1123
41.8902	5.0	915	4.9014	1.9474
5.5817	6.0	1098	3.9892	1.0188
5.5817	7.0	1281	3.5080	1.0104
5.5817	8.0	1464	3.0797	0.9905
3.5579	9.0	1647	2.8111	0.9836
3.5579	10.0	1830	2.6726	0.9815
2.7771	11.0	2013	2.7177	0.9809
2.7771	12.0	2196	2.3582	0.9692
2.7771	13.0	2379	2.1708	0.9757
2.3488	14.0	2562	2.0491	0.9526
2.3488	15.0	2745	1.8518	0.9378
2.3488	16.0	2928	1.6845	0.9286
1.7859	17.0	3111	1.6412	0.9280
1.7859	18.0	3294	1.5488	0.9035
1.7859	19.0	3477	1.4546	0.9010
1.3898	20.0	3660	1.5147	0.9201
1.3898	21.0	3843	1.4467	0.8959
1.1291	22.0	4026	1.4743	0.9035
1.1291	23.0	4209	1.3827	0.8762
1.1291	24.0	4392	1.3437	0.8792
0.8993	25.0	4575	1.2895	0.8577
0.8993	26.0	4758	1.2928	0.8558
0.8993	27.0	4941	1.2947	0.9163
0.6298	28.0	5124	1.3151	0.8738
0.6298	29.0	5307	1.2972	0.8514
0.6298	30.0	5490	1.3030	0.8432
0.4757	31.0	5673	1.3264	0.8364
0.4757	32.0	5856	1.3131	0.8421
0.3735	33.0	6039	1.3457	0.8588
0.3735	34.0	6222	1.3450	0.8473
0.3735	35.0	6405	1.3452	0.9218
0.3253	36.0	6588	1.3754	0.8397
0.3253	37.0	6771	1.3554	0.8353
0.3253	38.0	6954	1.3532	0.8312
0.2816	39.0	7137	1.3694	0.8345
0.2816	40.0	7320	1.3953	0.8296
0.2397	41.0	7503	1.3858	0.8293
0.2397	42.0	7686	1.3959	0.8402
0.2397	43.0	7869	1.4350	0.9318
0.2084	44.0	8052	1.4004	0.8806
0.2084	45.0	8235	1.3871	0.8255
0.2084	46.0	8418	1.4060	0.8252
0.1853	47.0	8601	1.3992	0.8501
0.1853	48.0	8784	1.4186	0.8252
0.1853	49.0	8967	1.4120	0.8165
0.1671	50.0	9150	1.4166	0.8214
0.1671	51.0	9333	1.4411	0.8501
0.1513	52.0	9516	1.4692	0.8394
0.1513	53.0	9699	1.4640	0.8391
0.1513	54.0	9882	1.4501	0.8419
0.133	55.0	10065	1.4134	0.8351
0.133	56.0	10248	1.4593	0.8405
0.133	57.0	10431	1.4560	0.8389
0.1198	58.0	10614	1.4734	0.8334
0.1198	59.0	10797	1.4649	0.8318
0.1198	60.0	10980	1.4659	0.8100
0.1109	61.0	11163	1.4784	0.8119
0.1109	62.0	11346	1.4938	0.8149
0.1063	63.0	11529	1.5050	0.8152
0.1063	64.0	11712	1.4773	0.8176
0.1063	65.0	11895	1.4836	0.8261
0.0966	66.0	12078	1.4979	0.8157
0.0966	67.0	12261	1.4603	0.8048
0.0966	68.0	12444	1.4803	0.8127
0.0867	69.0	12627	1.4974	0.8130
0.0867	70.0	12810	1.4721	0.8078
0.0867	71.0	12993	1.4644	0.8192
0.0827	72.0	13176	1.4835	0.8138
0.0827	73.0	13359	1.4934	0.8122
0.0734	74.0	13542	1.4951	0.8062
0.0734	75.0	13725	1.4908	0.8070
0.0734	76.0	13908	1.4876	0.8124
0.0664	77.0	14091	1.4934	0.8053
0.0664	78.0	14274	1.4603	0.8048
0.0664	79.0	14457	1.4732	0.8073
0.0602	80.0	14640	1.4925	0.8078
0.0602	81.0	14823	1.4812	0.8064
0.057	82.0	15006	1.4950	0.8013
0.057	83.0	15189	1.4785	0.8056
0.057	84.0	15372	1.4856	0.7993
0.0517	85.0	15555	1.4755	0.8034
0.0517	86.0	15738	1.4813	0.8034
0.0517	87.0	15921	1.4966	0.8048
0.0468	88.0	16104	1.4883	0.8002
0.0468	89.0	16287	1.4746	0.8023
0.0468	90.0	16470	1.4697	0.7974
0.0426	91.0	16653	1.4775	0.8004
0.0426	92.0	16836	1.4852	0.8023
0.0387	93.0	17019	1.4868	0.8004
0.0387	94.0	17202	1.4785	0.8021
0.0387	95.0	17385	1.4892	0.8015
0.0359	96.0	17568	1.4862	0.8018
0.0359	97.0	17751	1.4851	0.8007
0.0359	98.0	17934	1.4846	0.7999
0.0347	99.0	18117	1.4852	0.7993
0.0347	100.0	18300	1.4848	0.8004

Evaluation Commands

To evaluate on mozilla - foundation/common_voice_8_0 with split test

python eval.py --model_id ivanlau/wav2vec2 - large - xls - r - 300m - cantonese --dataset mozilla - foundation/common_voice_8_0 --config zh - HK --split test --log_outputs

To evaluate on speech - recognition - community - v2/dev_data

python eval.py --model_id ivanlau/wav2vec2 - large - xls - r - 300m - cantonese --dataset speech - recognition - community - v2/dev_data --config zh - HK --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.3
Tokenizers 0.11.0

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご