wav2vec2-base-vietnamese-250h Open Source Model - Achieving Automatic Vietnamese Speech Recognition

Wav2vec2 Base Vietnamese 250h

Developed by nguyenvulebinh

Vietnamese automatic speech recognition model based on wav2vec 2.0 architecture, trained on 13,000 hours of unlabeled audio and 250 hours of labeled data

Speech Recognition

Transformers

Other#Vietnamese speech recognition #Low word error rate (WER)#End-to-end ASR

Downloads 6,868

Release Time : 3/2/2022

Model Overview

This model is an end-to-end Vietnamese speech recognition system using Facebook's wav2vec 2.0 architecture, fine-tuned with CTC algorithm, supporting Vietnamese speech-to-text tasks.

Model Features

Large-scale Pretraining

Pretrained on 13,000 hours of Vietnamese YouTube audio

Efficient Fine-tuning

Fine-tuned with 250 hours of labeled speech data to optimize speech recognition performance

Supports Language Model Integration

Can be used with 4-gram language models to significantly reduce word error rate (WER)

End-to-End Solution

Simplifies traditional ASR pipeline by eliminating separate acoustic and language model components

Model Capabilities

Vietnamese speech recognition

Audio-to-text conversion

Supports 16kHz sample rate audio processing

Use Cases

Speech Transcription

Meeting Minutes

Convert Vietnamese meeting recordings into text transcripts

Achieves 6.15% word error rate on VIVOS test set

Voice Assistants

Provides speech recognition capability for Vietnamese voice assistants

Achieves 11.52% word error rate on Common Voice Vietnamese test set

Educational Applications

Language Learning

Helps learners practice Vietnamese pronunciation and listening

🚀 Vietnamese end-to-end speech recognition using wav2vec 2.0

This project utilizes wav2vec 2.0 to achieve end-to-end Vietnamese speech recognition, offering high - performance results on multiple datasets.

🚀 Quick Start

When using the model, ensure that your speech input is sampled at 16Khz and the audio length is shorter than 10s. You can follow the Colab link below to use a combination of CTC - wav2vec and 4 - grams LM.

✨ Features

Powerful Pre - training: The model is pre - trained on 13k hours of Vietnamese youtube audio (un - labeled data).
Fine - tuned on Quality Data: It is fine - tuned on 250 hours of labeled VLSP ASR dataset on 16kHz sampled speech audio.
Combined with Language Model: A 4 - grams model trained on 2GB of spoken text is provided to improve recognition accuracy.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")

# define function to read in sound file
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

# load dummy dataset and read soundfiles
ds = map_to_array({
    "file": 'audio-test/t1_0001-00010.wav'
})

# tokenize
input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

📚 Documentation

Model description

Our models are pre - trained on 13k hours of Vietnamese youtube audio (un - labeled data) and fine - tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio.

We use wav2vec2 architecture for the pre - trained model. Follow wav2vec2 paper:

For the first time that learning powerful representations from speech audio alone followed by fine - tuning on transcribed speech can outperform the best semi - supervised methods while being conceptually simpler.

For the fine - tuning phase, wav2vec2 is fine - tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence - to - sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.

Property	Details
Model Type	Our models
Training Data	Pre - trained on 13k hours of Vietnamese youtube audio (un - labeled data); fine - tuned on 250 hours labeled of VLSP ASR dataset

In a formal ASR system, two components are required: acoustic model and language model. Here ctc - wav2vec fine - tuned model works as an acoustic model. For the language model, we provide a 4 - grams model trained on 2GB of spoken text.

Detail of training and fine - tuning process, the audience can follow fairseq github and [huggingface blog](https://huggingface.co/blog/fine - tune - wav2vec2 - english).

Benchmark WER result:

	VIVOS	COMMON VOICE VI	VLSP - T1	VLSP - T2
without LM	10.77	18.34	13.33	51.45
with 4 - grams LM	6.15	11.52	9.11	40.81

🔧 Technical Details

The model uses the wav2vec2 architecture for pre - training. During the fine - tuning phase, Connectionist Temporal Classification (CTC) is applied. The pre - training data consists of 13k hours of un - labeled Vietnamese youtube audio, and the fine - tuning data is 250 hours of labeled VLSP ASR dataset. A 4 - grams language model trained on 2GB of spoken text is also used to enhance the recognition performance.

📄 License

The ASR model parameters are made available for non - commercial use only, under the terms of the Creative Commons Attribution - NonCommercial 4.0 International (CC BY - NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by - nc/4.0/legalcode

📖 Citation

@misc{Thai_Binh_Nguyen_wav2vec2_vi_2021,
  author = {Thai Binh Nguyen},
  doi = {10.5281/zenodo.5356039},
  month = {09},
  title = {{Vietnamese end - to - end speech recognition using wav2vec 2.0}},
  url = {https://github.com/vietai/ASR},
  year = {2021}
}

⚠️ Important Note

Please CITE our repo when it is used to help produce published results or is incorporated into other software.

📞 Contact

nguyenvulebinh@gmail.com / binh@vietai.org

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご