The open-source model wav2vec2-large-xlsr-53-vietnamese - Vietnamese automatic speech recognition supporting 16kHz speech

Wav2vec2 Large Xlsr 53 Vietnamese

Developed by not-tanh

A Vietnamese automatic speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Vietnamese speech recognition #XLSR-53 fine-tuning #Multi-dataset training

Downloads 22

Release Time : 3/2/2022

Model Overview

This model is an optimized automatic speech recognition (ASR) model for Vietnamese, based on the XLSR-53 architecture, fine-tuned on the Common Voice, VIVOS, and FOSD datasets.

Model Features

Multi-dataset fine-tuning

Fine-tuned using three Vietnamese datasets: Common Voice, VIVOS, and FOSD to improve recognition accuracy.

No language model required

Can be used directly without additional language model support.

16kHz sampling rate support

Optimized for 16kHz sampling rate audio input.

Model Capabilities

Vietnamese speech recognition

Audio to text conversion

Speech transcription

Use Cases

Speech transcription

Vietnamese speech to text

Convert Vietnamese speech content into text

Word Error Rate 39.57%

Voice assistants

Vietnamese voice command recognition

Used for command recognition in Vietnamese voice assistant systems

🚀 Wav2Vec2-Large-XLSR-53-Vietnamese

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Vietnamese, leveraging the Common Voice, Vivos dataset and FOSD dataset. It provides a solution for Vietnamese speech recognition.

🚀 Quick Start

When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Fine - tuned on multiple Vietnamese datasets including Common Voice, Vivos, and FOSD.
Suitable for Vietnamese speech recognition tasks.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "vi", split="test")

processor = Wav2Vec2Processor.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")
model = Wav2Vec2ForCTC.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "vi", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")
model = Wav2Vec2ForCTC.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")
model.to("cuda")

chars_to_ignore_regex = r'[,?.!\-;:"“%\'�]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 39.571823%

📚 Documentation

Model Information

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Vietnamese
Training Data	Common Voice, Vivos dataset, FOSD dataset
Metrics	Word Error Rate (WER)
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0

Model Index

Name: Ted Vietnamese XLSR Wav2Vec2 Large 53
- Results:
  - Task:
    - Name: Speech Recognition
    - Type: automatic - speech - recognition
  - Dataset:
    - Name: Common Voice vi
    - Type: common_voice
    - Args: vi
  - Metrics:
    - Name: Test WER
    - Type: wer
    - Value: 39.571823

🔧 Technical Details

The Common Voice train, validation, the VIVOS and FOSD datasets were used for training. The script used for training can be found ... # TODO

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご