chunkformer-large-vie開源越南語語音識別模型 - 精準識別約3000小時語音數據

首頁

Chunkformer Large Vie

由khanhld開發

基於ChunkFormer架構的大規模越南語自動語音識別模型，在約3000小時的越南語公開語音數據上微調，性能優異。

語音識別

PyTorch

其他#越南語語音識別 #長音頻處理 #低詞錯誤率

下載量 1,765

發布時間 : 2/1/2025

模型概述

ChunkFormer-Large-Vie是一個專門針對越南語優化的自動語音識別模型，採用ChunkFormer架構，在多個公開數據集上取得了領先的性能表現。

模型特點

高性能越南語識別

在Common Voice Vi和VIVOS數據集上取得SOTA成績，WER分別為6.66和4.18。

長音頻處理能力

支持長音頻轉錄，通過分塊處理技術優化內存使用和計算效率。

多數據集訓練

在約3000小時的多樣化越南語語音數據上訓練，覆蓋多種場景和口音。

模型能力

越南語語音識別

長音頻轉錄

即時語音轉文字

使用案例

語音轉寫

會議記錄

將越南語會議錄音自動轉寫為文字記錄

高準確率的轉錄結果

語音助手

為越南語語音助手提供語音識別能力

低延遲、高準確率的識別

教育

語言學習

幫助學習者練習越南語發音和聽力

提供準確的發音評估

🚀 ChunkFormer-Large-Vie：用於越南語自動語音識別的大規模預訓練ChunkFormer模型

ChunkFormer-Large-Vie是一個基於ChunkFormer架構的大規模越南語自動語音識別（ASR）模型，在ICASSP 2025會議上被提出。該模型解決了越南語語音識別的準確性和效率問題，為越南語語音處理提供了強大的工具，具有重要的應用價值。

🚀 快速開始

要使用ChunkFormer模型進行越南語自動語音識別，請按照以下步驟操作：

1. 下載ChunkFormer倉庫

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

2. 從Hugging Face下載模型檢查點

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

或者

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

這將把模型檢查點下載到chunkformer目錄內的checkpoints文件夾中。

3. 運行模型

python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-vie \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ # 以秒為單位，默認值為1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

示例輸出：

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

高級用法 可在此處找到。

✨ 主要特性

ChunkFormer架構：ChunkFormer-Large-Vie基於ChunkFormer架構，在ICASSP 2025會議上被提出。
大規模預訓練：該模型在約3000小時的公開越南語語音數據上進行了微調，這些數據來自多個不同的數據集。

📦 安裝指南

下載ChunkFormer倉庫

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

下載模型檢查點

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

或者

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

📚 詳細文檔

ChunkFormer的文檔和實現是公開可用的。

🔧 技術細節

模型描述

ChunkFormer-Large-Vie 是一個基於 ChunkFormer 架構的大規模越南語自動語音識別（ASR）模型，在 ICASSP 2025 會議上被提出。該模型在約 3000 小時 的公開越南語語音數據上進行了微調，這些數據來自多個不同的數據集。數據集列表可在此處找到。

!!! 請注意，僅使用了 [train-subset] 來調整模型。

基準測試結果

我們使用 單詞錯誤率（WER） 來評估模型。為了確保比較的一致性和公平性，我們手動應用了 文本歸一化，包括處理數字、大寫字母和標點符號。

公開模型

STT	模型	參數數量	Vivos	通用語音	VLSP - 任務 1	平均值
1	ChunkFormer	110M	4.18	6.66	14.09	8.31
2	vinai/PhoWhisper-large	1.55B	4.67	8.14	13.75	8.85
3	nguyenvulebinh/wav2vec2-base-vietnamese-250h	95M	10.77	18.34	13.33	14.15
4	openai/whisper-large-v3	1.55B	8.81	15.45	20.41	14.89
5	khanhld/wav2vec2-base-vietnamese-160h	95M	15.05	10.78	31.62	19.16
6	homebrewltd/Ichigo-whisper-v0.1	22M	13.46	23.52	21.64	19.54

私有模型（API）

STT	模型	VLSP - 任務 1
1	ChunkFormer	14.1
2	Viettel	14.5
3	Google	19.5
4	FPT	28.8

📄 許可證

本模型採用 CC BY-NC 4.0 許可證。

📖 引用

如果您在研究中使用了此工作，請引用：

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}
}