Model Overview
Model Features
Model Capabilities
Use Cases
đ Wav2Vec2-Large-XLSR-53-Vietnamese
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Vietnamese, using the Common Voice and Infore_25h dataset (Password: BroughtToYouByInfoRe). When using this model, ensure that your speech input is sampled at 16kHz.
đĻ Model Information
Property | Details |
---|---|
Model Type | Audio, Automatic Speech Recognition, Speech, XLSR - Fine - Tuning - Week |
Training Datasets | Common Voice, Infore_25h |
Evaluation Metric | WER (Word Error Rate) |
License | Apache - 2.0 |
Model Index
- Name: Cuong - Cong XLSR Wav2Vec2 Large 53
- Results:
- Task:
- Name: Speech Recognition
- Type: automatic - speech - recognition
- Dataset:
- Name: Common Voice vi
- Type: common_voice
- Args: vi
- Metrics:
- Name: Test WER
- Type: wer
- Value: 58.63
- Task:
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "vi", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
đ Evaluation
The model can be evaluated on the Vietnamese test data of Common Voice as follows:
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
test_dataset = load_dataset("common_voice", "vi", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Test Result: 58.63 %
đ§ Training
The Common Voice train
, validation
, and Infore_25h
datasets were used for training. The script used for training can be found here.
Your model is then available under huggingface.co/CuongLD/wav2vec2-large-xlsr-vietnamese for everyone to use đ.
đ How to Evaluate My Trained Checkpoint
After uploading your model, you should evaluate it in a final step. This can be as simple as copying the evaluation code from your model card into a Python script and running it. Make sure to note the final result on the model card both under the YAML tags at the very top and below your evaluation code under "Test Results".
đ Rules of Training and Evaluation
Training Data
All data except the official common voice test
dataset can be used as training data. For models trained in a language not included in Common Voice, the model author is responsible for setting aside a reasonable amount of data for evaluation.
Data Preprocessing
It is allowed (and recommended) to normalize the data to only have lower - case characters and remove typographical symbols and punctuation marks. However, we should not remove symbols that change the meaning of words. For example, in English, we should not remove the single quotation mark '
. When in doubt, feel free to ask on Slack or post on the forum, like here.
đĄ Tips and Tricks
Combine Multiple Datasets
Check out this post.
Load Datasets with Limited Resources
Check out this post.
đ Further Reading Material
It is recommended to learn about how Wav2vec2 works in theory. Understanding the theory and inner mechanisms of the model can help with fine - tuning. However, it is not necessary to go through the theory to fine - tune Wav2Vec2 on your chosen language.
Here are some resources to better understand Wav2Vec2:
- Facebook's Wav2Vec2 blog post
- Official Wav2Vec2 paper
- Official XLSR Wav2vec2 paper
- Hugging Face Blog
- How does CTC (Connectionist Temporal Classification) work
Key Points to Understand
- Pretraining: XLSR - Wav2Vec2 was pretrained by masking feature vectors and having the model predict them, similar to BERT's masked language model.
- Model Parts: The feature extractor extracts feature vectors from the 1D raw audio waveform, and the transformer maps feature vectors to contextualized feature vectors.
- Fine - Tuning: The language head needs to be fine - tuned, and the authors recommend not further fine - tuning the feature extractor.
- Training Data: The checkpoint was pretrained on 53 languages.
- Similar Languages: The official XLSR Wav2Vec2 paper shows which languages share a common contextualized latent space.
â FAQ
- Can a participant fine - tune models for more than one language?
- Yes! A participant can fine - tune models in as many languages as they like.
- Can a participant use extra data (apart from the common voice data)?
- Yes! All data except the official common voice
test data
can be used for training. If training on a language not in Common Voice, some test data should be held out to prevent overfitting.
- Yes! All data except the official common voice
- Can we fine - tune for high - resource languages?
- Yes! While we don't recommend fine - tuning models in English due to the large number of existing models, it is appreciated if participants fine - tune models in other "high - resource" languages like French, Spanish, or German. For such cases, local training and tricks like lazy data loading might be needed.

