Chunkformer-large-vie Open-source Vietnamese Speech Recognition Model - Accurately Recognize Approximately 3000 Hours of Speech Data

Chunkformer Large Vie

Developed by khanhld

A large-scale Vietnamese automatic speech recognition model based on the ChunkFormer architecture, fine-tuned on approximately 3000 hours of publicly available Vietnamese speech data, with excellent performance.

Speech Recognition

PyTorch

Other#Vietnamese speech recognition #Long audio processing #Low word error rate

Downloads 1,765

Release Time : 2/1/2025

Model Overview

ChunkFormer-Large-Vie is an automatic speech recognition model specifically optimized for Vietnamese, using the ChunkFormer architecture, achieving leading performance on multiple public datasets.

Model Features

High-performance Vietnamese recognition

Achieved SOTA results on the Common Voice Vi and VIVOS datasets, with WERs of 6.66 and 4.18, respectively.

Long audio processing capability

Supports transcription of long audio, optimizing memory usage and computational efficiency through chunk processing technology.

Multi-dataset training

Trained on approximately 3000 hours of diverse Vietnamese speech data, covering various scenarios and accents.

Model Capabilities

Vietnamese speech recognition

Long audio transcription

Real-time speech-to-text

Use Cases

Speech transcription

Meeting minutes

Automatically transcribe Vietnamese meeting recordings into text records

Highly accurate transcription results

Voice assistant

Provide speech recognition capabilities for Vietnamese voice assistants

Low-latency, high-accuracy recognition

Education

Language learning

Help learners practice Vietnamese pronunciation and listening

Provide accurate pronunciation evaluation

🚀 ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition

ChunkFormer-Large-Vie is a large - scale Vietnamese Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, offering high - performance speech recognition for Vietnamese.

🚀 Quick Start

To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

Download the ChunkFormer Repository

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

Download the Model Checkpoint from Hugging Face

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

Run the model

python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-vie \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Advanced Usage can be found HERE

✨ Features

High - performance ASR: Based on the ChunkFormer architecture, it provides accurate speech recognition for Vietnamese.
Large - scale training: Fine - tuned on approximately 3000 hours of public Vietnamese speech data from diverse datasets.

📦 Installation

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

💻 Usage Examples

Basic Usage

python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-vie \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Advanced Usage

For advanced usage, please refer to HERE

📚 Documentation

The Documentation and Implementation of ChunkFormer are publicly available.

🔧 Technical Details

ChunkFormer-Large-Vie is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on approximately 3000 hours of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found HERE.

!!! Please note that only the [train-subset] was used for tuning the model.

We evaluate the models using Word Error Rate (WER). To ensure consistency and fairness in comparison, we manually apply Text Normalization, including the handling of numbers, uppercase letters, and punctuation.

📄 License

This project is licensed under the CC BY - NC 4.0 license.

📊 Benchmark Results

Public Models

STT	Model	#Params	Vivos	Common Voice	VLSP - Task 1	Avg.
1	ChunkFormer	110M	4.18	6.66	14.09	8.31
2	vinai/PhoWhisper-large	1.55B	4.67	8.14	13.75	8.85
3	nguyenvulebinh/wav2vec2-base-vietnamese-250h	95M	10.77	18.34	13.33	14.15
4	openai/whisper-large-v3	1.55B	8.81	15.45	20.41	14.89
5	khanhld/wav2vec2-base-vietnamese-160h	95M	15.05	10.78	31.62	19.16
6	homebrewltd/Ichigo-whisper-v0.1	22M	13.46	23.52	21.64	19.54

Private Models (API)

STT	Model	VLSP - Task 1
1	ChunkFormer	14.1
2	Viettel	14.5
3	Google	19.5
4	FPT	28.8

📝 Citation

If you use this work in your research, please cite:

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}
}

📞 Contact

khanhld218@gmail.com

📋 Information Table

Property	Details
Model Type	Vietnamese Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture
Training Data	Approximately 3000 hours of public Vietnamese speech data from diverse datasets, list available HERE
Metrics	Word Error Rate (WER)
Pipeline Tag	automatic - speech - recognition
Tags	transcription, audio, speech, chunkformer, asr, automatic - speech - recognition
License	cc - by - nc - 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご