Open-source Hakka Chin language speech recognition model of wav2vec2-large-xlsr-cnh - A practical choice for training on general datasets

Wav2vec2 Large Xlsr Cnh

Developed by gchhablani

A Hakha Chin speech recognition model fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, trained on the Common Voice dataset with a test WER of 31.38%.

Speech Recognition OtherOpen Source License:Apache-2.0 #Hakha Chin speech recognition #Low-resource language ASR #XLSR fine-tuning

Downloads 22

Release Time : 3/2/2022

Model Overview

This is a model for automatic speech recognition (ASR) in Hakha Chin, fine-tuned based on the Wav2Vec2 Large XLSR-53 architecture, capable of converting Hakha Chin speech into text.

Model Features

Based on XLSR-53 Architecture

Uses facebook's wav2vec2-large-xlsr-53 as the base model, an architecture that excels in large-scale cross-lingual speech representation learning.

Low-resource Language Support

Specifically optimized for Hakha Chin, a less-resourced language, helping to preserve linguistic diversity.

No Language Model Required

Can be used directly without additional language models, simplifying deployment.

Model Capabilities

Speech recognition

Hakha Chin speech-to-text

16kHz audio processing

Use Cases

Speech Technology

Hakha Chin Speech Transcription

Automatically convert Hakha Chin speech content into text

Word Error Rate (WER) 31.38%

Voice Assistant Development

Develop voice interaction applications for Hakha Chin users

Language Preservation

Minority Language Digitization

Help preserve and digitize minority languages like Hakha Chin

🚀 Wav2Vec2-Large-XLSR-53-Hakha-Chin

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Hakha Chin using the Common Voice dataset. It can be used for automatic speech recognition tasks.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine-tuned on the Hakha Chin language using the Common Voice dataset.
Can be used directly for speech recognition without a language model.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "cnh", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-cnh")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-cnh/")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The model can be evaluated as follows on the Portuguese test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "cnh", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-cnh")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-cnh")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\/]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)

  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 31.38 %

📚 Documentation

The Common Voice train and validation datasets were used for training. The script used for training can be found here.

🔧 Technical Details

No technical details are provided in the original document, so this section is skipped.

📄 License

This model is licensed under the Apache-2.0 license.

📊 Model Information

Property	Details
Model Type	Wav2Vec2-Large-XLSR-53-Hakha-Chin
Training Data	Common Voice `train` and `validation` datasets
Metrics	WER (Word Error Rate)
Task	Automatic Speech Recognition
Dataset	Common Voice cnh
Test WER	31.38

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご