Wav2vec2-large-vi-vlsp2020 Open-source Vietnamese Speech Recognition Model - Free and Accurate Audio-to-Text Conversion

Wav2vec2 Large Vi Vlsp2020

Developed by nguyenvulebinh

Vietnamese automatic speech recognition model based on wav2vec2 architecture, pre-trained with 13,000 hours of unlabeled YouTube audio and fine-tuned on 250 hours of labeled data

Speech Recognition

Transformers

Other#Vietnamese speech recognition #High-precision WER #5-gram language model optimization

Downloads 385

Release Time : 11/4/2022

Model Overview

This model is specifically designed for Vietnamese speech recognition tasks, supporting 16kHz sample rate audio input and outputting transcribed text. It includes both base and large versions, with support for integrating language models to improve recognition accuracy.

Model Features

Large-scale Pre-training

Pre-trained with 13,000 hours of Vietnamese YouTube audio to learn rich speech feature representations

Domain Fine-tuning

Fine-tuned on 250 hours of labeled data from the VLSP ASR dataset to optimize Vietnamese recognition performance

Language Model Integration

Supports integration with 5-gram language models, significantly reducing word error rate (WER)

High Performance

Achieves a word error rate of 5.32% on the VLSP T1 test set (when using language model)

Model Capabilities

Vietnamese speech recognition

Audio transcription

Supports 16kHz sample rate audio processing

Use Cases

Speech Transcription

Vietnamese Meeting Minutes

Automatically transcribe Vietnamese meeting recordings into text records

Accuracy exceeds 93% (when using language model)

Media Subtitle Generation

Automatically generate subtitles for Vietnamese video content

Voice Assistants

Vietnamese Voice Command Recognition

Used as the front-end speech recognition module for Vietnamese voice assistants

🚀 Vietnamese ASR Wav2Vec2 Models

These models are designed for Vietnamese automatic speech recognition, leveraging the wav2vec2 architecture and trained on extensive Vietnamese audio data.

🚀 Quick Start

You can quickly start using these models by referring to the usage examples below or by clicking the Colab link to run the code online.

✨ Features

Advanced Architecture: Utilize the wav2vec2 architecture, which is effective for speech recognition tasks.
Extensive Training: Pre - trained on 13k hours of un - labeled Vietnamese YouTube audio and fine - tuned on 250 hours of labeled VLSP ASR dataset.
Low WER: Achieve low Word Error Rate (WER) on the VLSP T1 testset, with and without Language Model (LM).

📦 Installation

#pytorch
#!pip install transformers==4.20.0
#!pip install https://github.com/kpu/kenlm/archive/master.zip
#!pip install pyctcdecode==0.4.0
#!pip install huggingface_hub==0.10.0

💻 Usage Examples

Basic Usage

from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
from transformers import Wav2Vec2ProcessorWithLM
from IPython.lib.display import Audio
import torchaudio
import torch

# Load model & processor
model_name = "nguyenvulebinh/wav2vec2-large-vi-vlsp2020"
model = SourceFileLoader("model", cached_path(hf_bucket_url(model_name,filename="model_handling.py"))).load_module().Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# Load an example audio (16k)
audio, sample_rate = torchaudio.load(cached_path(hf_bucket_url(model_name, filename="t2_0000006682.wav")))
input_data = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors='pt')

# Infer
output = model(**input_data)

# Output transcript without LM
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))

# Output transcript with LM
print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)

📚 Documentation

Model description

Our models use wav2vec2 architecture, pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio. You can find more description here

Benchmark WER result on VLSP T1 testset:

	base model	large model
without LM	8.66	6.90
with 5-grams LM	6.53	5.32

Model Parameters License

The ASR model parameters are made available for non-commercial use only, under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode

Contact

nguyenvulebinh@gmail.com

📄 License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. For more information, please visit here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご