wav2vec2-base-vietnamese: An Open-Source Vietnamese Speech Recognition Model

Wav2vec2 Base Vietnamese

Developed by dragonSwing

Vietnamese speech recognition model based on Wav2Vec2 architecture, fine-tuned on VSLP dataset, supports 16kHz sampled speech input

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Vietnamese speech recognition #16kHz sampling rate #No language model dependency

Downloads 16

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Vietnamese, based on Facebook's Wav2Vec2 architecture, fine-tuned with 100 hours of annotated data, and can be directly used for speech-to-text tasks

Model Features

Vietnamese optimization

Specially trained and optimized for Vietnamese speech characteristics

No language model required

Can be used directly without additional language model support

Efficient processing

Supports 16kHz sampled speech input, suitable for real-time applications

Model Capabilities

Vietnamese speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech transcription

Convert Vietnamese speech content into text

WER of 31.35% on Common Voice test set

Smart assistants

Vietnamese voice command recognition

Used for human-computer interaction in Vietnamese smart voice assistants

🚀 Wav2Vec2-Large-XLSR-53-Vietnamese

This model is fine - tuned for Vietnamese Speech Recognition, leveraging pre - trained weights and specific datasets.

🚀 Quick Start

This model is fine-tuned from dragonSwing/wav2vec2-base-pretrain-vietnamese on the Vietnamese Speech Recognition task. It uses 100h of labelled data from the VSLP dataset. When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Datasets: Utilizes datasets such as vlsp and common_voice.
Metrics: Evaluated using the Word Error Rate (WER) metric.
Tags: Related to audio, automatic - speech - recognition, and speech.
License: Released under the Apache 2.0 license.

Property	Details
Model Type	Wav2Vec2 - Large - XLSR - 53 - Vietnamese
Training Data	100h labelled data from VSLP dataset
Datasets	vlsp, common_voice
Metrics	wer
Tags	audio, automatic - speech - recognition, speech
License	Apache 2.0

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "vi", split="test")
processor = Wav2Vec2Processor.from_pretrained("dragonSwing/wav2vec2-base-vietnamese")
model = Wav2Vec2ForCTC.from_pretrained("dragonSwing/wav2vec2-base-vietnamese")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
test_dataset = load_dataset("common_voice", "vi", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("dragonSwing/wav2vec2-base-vietnamese")
model = Wav2Vec2ForCTC.from_pretrained("dragonSwing/wav2vec2-base-vietnamese")
model.to("cuda")
chars_to_ignore_regex = r'[,?.!\-;:"“%\'�]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch
result = test_dataset.map(evaluate, batched=True, batch_size=1)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 31.353591%

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご