Wav2vec2 Base Vietnamese 160h

Developed by khanhld

Vietnamese speech recognition model based on Wav2vec2, fine-tuned on 160 hours of Vietnamese speech data

Speech Recognition

Transformers

Other#Vietnamese speech recognition #No language model optimization #Multi-dataset training

Downloads 356

Release Time : 5/7/2022

Model Overview

This model is a Vietnamese automatic speech recognition (ASR) model based on the Wav2vec2 architecture, fine-tuned on approximately 160 hours of Vietnamese speech datasets, supporting Vietnamese speech-to-text tasks.

Model Features

Multi-dataset training

The model was trained on multiple Vietnamese speech datasets including VIVOS, COMMON VOICE, FOSD, and VLSP

No language model support

Achieves good recognition results even without an integrated language model

Open-source implementation

Provides complete pre-training and fine-tuning code, supporting custom dataset training

Model Capabilities

Vietnamese speech recognition

Audio-to-text conversion

Speech transcription

Use Cases

Speech transcription

Vietnamese speech transcription

Convert Vietnamese speech content into text

Achieved a WER of 10.78% on the Common Voice Vietnamese test set

Voice assistants

Vietnamese voice command recognition

Used as the front-end speech recognition module for Vietnamese voice assistants

language: vi datasets:

vivos
common_voice
FOSD
VLSP metrics:
wer pipeline_tag: automatic-speech-recognition tags:
audio
speech
Transformer
wav2vec2
automatic-speech-recognition
vietnamese license: cc-by-nc-4.0 widget:
example_title: common_voice_vi_30519758.mp3 src: https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h/raw/main/examples/common_voice_vi_30519758.mp3
example_title: VIVOSDEV15_020.wav src: https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h/raw/main/examples/VIVOSDEV15_020.wav model-index:
name: Wav2vec2 Base Vietnamese 160h results:
- task: name: Speech Recognition type: automatic-speech-recognition dataset: name: common-voice-vietnamese type: common_voice args: vi metrics:
  - name: Test WER type: wer value: 10.78
- task: name: Speech Recognition type: automatic-speech-recognition dataset: name: VIVOS type: vivos args: vi metrics:
  - name: Test WER type: wer value: 15.05

Vietnamese Speech Recognition using Wav2vec 2.0

Model Description
Implementation
Benchmark Result
Example Usage
Evaluation
Citation
Contact

Model Description

Fine-tuned the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources, including VIOS, COMMON VOICE, FOSD and VLSP 100h. We have not yet incorporated the Language Model into our ASR system but still gained a promising result.

Implementation

We also provide code for Pre-training and Fine-tuning the Wav2vec2 model. If you wish to train on your dataset, check it out here:

Benchmark WER Result

	VIVOS	COMMON VOICE 8.0
without LM	15.05	10.78
with LM	in progress	in progress

Example Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import librosa
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model.to(device)

def transcribe(wav):
  input_values = processor(wav, sampling_rate=16000, return_tensors="pt").input_values
  logits = model(input_values.to(device)).logits
  pred_ids = torch.argmax(logits, dim=-1)
  pred_transcript = processor.batch_decode(pred_ids)[0]
  return pred_transcript


wav, _ = librosa.load('path/to/your/audio/file', sr = 16000)
print(f"transcript: {transcribe(wav)}")

Evaluation

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
import re
from datasets import load_dataset, load_metric, Audio

wer = load_metric("wer")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# load processor and model
processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model.to(device)
model.eval()

# Load dataset
test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test", use_auth_token="your_huggingface_auth_token")
test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000))
chars_to_ignore = r'[,?.!\-;:"“%\'�]' # ignore special characters

# preprocess data
def preprocess(batch):
  audio = batch["audio"]
  batch["input_values"] = audio["array"]
  batch["transcript"] = re.sub(chars_to_ignore, '', batch["sentence"]).lower()
  return batch

# run inference
def inference(batch):
  input_values = processor(batch["input_values"], 
                            sampling_rate=16000, 
                            return_tensors="pt").input_values
  logits = model(input_values.to(device)).logits
  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_transcript"] = processor.batch_decode(pred_ids) 
  return batch
  
test_dataset = test_dataset.map(preprocess)
result = test_dataset.map(inference, batched=True, batch_size=1)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"])))

Test Result: 10.78%

Citation

BibTeX

@mics{Duy_Khanh_Finetune_Wav2vec_2_0_2022,
  author = {Duy Khanh, Le},
  doi = {10.5281/zenodo.6542357},
  license = {CC-BY-NC-4.0},
  month = {5},
  title = {{Finetune Wav2vec 2.0 For Vietnamese Speech Recognition}},
  url = {https://github.com/khanld/ASR-Wa2vec-Finetune},
  year = {2022}
}

APA

Duy Khanh, L. (2022). Finetune Wav2vec 2.0 For Vietnamese Speech Recognition [Data set]. https://doi.org/10.5281/zenodo.6542357

Contact

khanhld218@gmail.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご