Wav2Vec2-large-xlsr-Vietnamese Open-source Speech Recognition Model

Wav2vec2 Large Xlsr Vietnamese

Developed by CuongLD

This is a Vietnamese fine-tuned speech recognition model based on facebook/wav2vec2-large-xlsr-53, trained using the Common Voice and Infore_25h datasets.

Speech Recognition OtherOpen Source License:Apache-2.0 #Vietnamese speech recognition #Multi-dataset fine-tuning #Low-resource optimization

Downloads 37

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Vietnamese speech recognition tasks, supporting 16kHz sampling rate audio input.

Model Features

Multi-dataset training

Trained using both Common Voice and Infore_25h datasets to enhance model generalization.

16kHz sampling rate support

Specially optimized for 16kHz sampling rate audio input recognition.

No language model required

Can be used directly without additional language model support.

Model Capabilities

Vietnamese speech recognition

Automatic speech-to-text

Use Cases

Speech transcription

Vietnamese speech transcription

Convert Vietnamese speech content into text

WER of 58.63% on Common Voice Vietnamese test set

Voice assistants

Vietnamese voice command recognition

Basic speech recognition component for Vietnamese voice assistants

🚀 Wav2Vec2-Large-XLSR-53-Vietnamese

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Vietnamese, using the Common Voice and Infore_25h dataset (Password: BroughtToYouByInfoRe). When using this model, ensure that your speech input is sampled at 16kHz.

📦 Model Information

Property	Details
Model Type	Audio, Automatic Speech Recognition, Speech, XLSR - Fine - Tuning - Week
Training Datasets	Common Voice, Infore_25h
Evaluation Metric	WER (Word Error Rate)
License	Apache - 2.0

Model Index

Name: Cuong - Cong XLSR Wav2Vec2 Large 53
Results:
- Task:
  - Name: Speech Recognition
  - Type: automatic - speech - recognition
- Dataset:
  - Name: Common Voice vi
  - Type: common_voice
  - Args: vi
- Metrics:
  - Name: Test WER
  - Type: wer
  - Value: 58.63

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "vi", split="test[:2%]") 
processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese") 
model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Evaluation

The model can be evaluated on the Vietnamese test data of Common Voice as follows:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "vi", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese") 
model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' 
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 58.63 %

🔧 Training

The Common Voice train, validation, and Infore_25h datasets were used for training. The script used for training can be found here.

Your model is then available under huggingface.co/CuongLD/wav2vec2-large-xlsr-vietnamese for everyone to use 🎉.

📖 How to Evaluate My Trained Checkpoint

After uploading your model, you should evaluate it in a final step. This can be as simple as copying the evaluation code from your model card into a Python script and running it. Make sure to note the final result on the model card both under the YAML tags at the very top and below your evaluation code under "Test Results".

📋 Rules of Training and Evaluation

Training Data

All data except the official common voice test dataset can be used as training data. For models trained in a language not included in Common Voice, the model author is responsible for setting aside a reasonable amount of data for evaluation.

Data Preprocessing

It is allowed (and recommended) to normalize the data to only have lower - case characters and remove typographical symbols and punctuation marks. However, we should not remove symbols that change the meaning of words. For example, in English, we should not remove the single quotation mark '. When in doubt, feel free to ask on Slack or post on the forum, like here.

💡 Tips and Tricks

Combine Multiple Datasets

Check out this post.

Load Datasets with Limited Resources

Check out this post.

📚 Further Reading Material

It is recommended to learn about how Wav2vec2 works in theory. Understanding the theory and inner mechanisms of the model can help with fine - tuning. However, it is not necessary to go through the theory to fine - tune Wav2Vec2 on your chosen language.

Here are some resources to better understand Wav2Vec2:

Key Points to Understand

Pretraining: XLSR - Wav2Vec2 was pretrained by masking feature vectors and having the model predict them, similar to BERT's masked language model.
Model Parts: The feature extractor extracts feature vectors from the 1D raw audio waveform, and the transformer maps feature vectors to contextualized feature vectors.
Fine - Tuning: The language head needs to be fine - tuned, and the authors recommend not further fine - tuning the feature extractor.
Training Data: The checkpoint was pretrained on 53 languages.
Similar Languages: The official XLSR Wav2Vec2 paper shows which languages share a common contextualized latent space.

❓ FAQ

Can a participant fine - tune models for more than one language?
- Yes! A participant can fine - tune models in as many languages as they like.
Can a participant use extra data (apart from the common voice data)?
- Yes! All data except the official common voice test data can be used for training. If training on a language not in Common Voice, some test data should be held out to prevent overfitting.
Can we fine - tune for high - resource languages?
- Yes! While we don't recommend fine - tuning models in English due to the large number of existing models, it is appreciated if participants fine - tune models in other "high - resource" languages like French, Spanish, or German. For such cases, local training and tricks like lazy data loading might be needed.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご