wav2vec2-base-vi-vlsp2020 Open-source Model - Accurately Implement Automatic Vietnamese Speech Recognition

Wav2vec2 Base Vi Vlsp2020

Developed by nguyenvulebinh

A Vietnamese automatic speech recognition model based on the wav2vec2 architecture, pre-trained on 13,000 hours of unlabeled YouTube audio and fine-tuned on 250 hours of labeled data.

Speech Recognition

Transformers

Other#Vietnamese ASR #wav2vec2 architecture #low WER

Downloads 262

Release Time : 11/4/2022

Model Overview

This model is specifically designed for Vietnamese automatic speech recognition (ASR) and supports decoding with a language model to improve accuracy.

Model Features

Large-scale Pre-training

Self-supervised pre-training using 13,000 hours of Vietnamese YouTube audio

High-precision Fine-tuning

Fine-tuned on 250 hours of labeled data from the VLSP ASR dataset

Language Model Integration

Supports 5-gram language model decoding, significantly reducing WER

Model Capabilities

Vietnamese speech recognition

Speech decoding with language model

Use Cases

Speech Transcription

Vietnamese Speech to Text

Convert Vietnamese speech content into text

Test set WER as low as 5.32% (with language model)

🚀 Vietnamese ASR Model

This project presents an Automatic Speech Recognition (ASR) model for Vietnamese, leveraging the wav2vec2 architecture. It's pre - trained on a large amount of Vietnamese YouTube audio and fine - tuned on a labeled dataset, offering high - quality speech recognition capabilities.

🚀 Quick Start

You can quickly start using this model by clicking the following button to open it in Google Colab:

✨ Features

Powerful Architecture: Utilizes the wav2vec2 architecture, which is effective for speech recognition tasks.
Large - scale Pre - training: Pre - trained on 13k hours of un - labeled Vietnamese YouTube audio, enabling the model to capture rich language features.
Fine - tuning: Fine - tuned on 250 hours of labeled VLSP ASR dataset, improving the accuracy of speech recognition.

📦 Installation

To use this model, you need to install the following dependencies:

#pytorch
#!pip install transformers==4.20.0
#!pip install https://github.com/kpu/kenlm/archive/master.zip
#!pip install pyctcdecode==0.4.0
#!pip install huggingface_hub==0.10.0

💻 Usage Examples

Basic Usage

from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
from transformers import Wav2Vec2ProcessorWithLM
from IPython.lib.display import Audio
import torchaudio
import torch

# Load model & processor
model_name = "nguyenvulebinh/wav2vec2-base-vi-vlsp2020"
model = SourceFileLoader("model", cached_path(hf_bucket_url(model_name,filename="model_handling.py"))).load_module().Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# Load an example audio (16k)
audio, sample_rate = torchaudio.load(cached_path(hf_bucket_url(model_name, filename="t2_0000006682.wav")))
input_data = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors='pt')

# Infer
output = model(**input_data)

# Output transcript without LM
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))

# Output transcript with LM
print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)

📚 Documentation

Model Description

Our models use the wav2vec2 architecture, pre - trained on 13k hours of Vietnamese YouTube audio (un - labeled data) and fine - tuned on 250 hours of labeled VLSP ASR dataset on 16kHz sampled speech audio. You can find more description here

Benchmark WER Result on VLSP T1 Testset

Property	Details
Model Type	Our ASR model has two versions: base and large.
Training Data	Pre - trained on 13k hours of Vietnamese YouTube audio (un - labeled) and fine - tuned on 250 hours of labeled VLSP ASR dataset.

	base model	large model
without LM	8.66	6.90
with 5 - grams LM	6.53	5.32

Model Parameters License

The ASR model parameters are made available for non - commercial use only, under the terms of the Creative Commons Attribution - NonCommercial 4.0 International (CC BY - NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by - nc/4.0/legalcode

Contact

If you have any questions, please contact us at nguyenvulebinh@gmail.com.

📄 License

The ASR model parameters are licensed under the Creative Commons Attribution - NonCommercial 4.0 International (CC BY - NC 4.0) license. For more details, visit: https://creativecommons.org/licenses/by - nc/4.0/legalcode

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご