wav2vec2-large-xlsr-53-odia Open-source Model - Free Deployment to Achieve Automatic Speech Recognition for Odia

Wav2vec2 Large Xlsr 53 Odia

Developed by theainerd

Fine-tuned Odia automatic speech recognition model based on facebook/wav2vec2-large-xlsr-53, trained using Low Resource Indian Language Challenge data

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Odia speech recognition #Low-resource language ASR #XLSR fine-tuned model

Downloads 83

Release Time : 3/2/2022

Model Overview

This is a model specifically designed for Odia speech recognition, fine-tuned from Facebook's XLSR-53 architecture, suitable for 16kHz sampled speech input.

Model Features

Low-resource language optimization

Specifically optimized for low-resource languages like Odia, using dedicated datasets for fine-tuning

No language model required

Can be used directly without additional language model support

16kHz sampling rate support

Specifically processes speech input with 16kHz sampling rate

Model Capabilities

Odia speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech transcription

Odia speech transcription

Convert Odia speech content into text

Word error rate 68.75%

Voice-assisted applications

Voice control interface

Develop voice control applications for Odia-speaking users

🚀 Wav2Vec2-Large-XLSR-53-Odia

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Odia, leveraging data from Multilingual and code - switching ASR challenges for low resource Indian languages. Ensure your speech input is sampled at 16kHz when using this model.

🚀 Quick Start

This fine - tuned model is based on facebook/wav2vec2-large-xlsr-53 for the Odia language, using data from Multilingual and code - switching ASR challenges for low resource Indian languages. When using this model, the speech input should be sampled at 16kHz.

✨ Features

Language Focus: Specifically fine - tuned for the Odia language.
Data Source: Utilizes data from the Multilingual and code - switching ASR challenges for low resource Indian languages.
Sampling Requirement: Requires speech input sampled at 16kHz.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "or", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model = Wav2Vec2ForCTC.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "or", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model = Wav2Vec2ForCTC.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
      logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

      pred_ids = torch.argmax(logits, dim=-1)
      batch["pred_strings"] = processor.batch_decode(pred_ids)
      return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 Documentation

Test Result

The model achieved a Word Error Rate (WER) of 68.75% on the test data.

Training

The script used for training can be found Odia ASR Fine Tuning Wav2Vec2.

📄 License

This model is licensed under the Apache 2.0 license.

📦 Model Information

Property	Details
Model Type	Wav2Vec2 - Large - XLSR - 53 - Odia
Training Data	OpenSLR
Evaluation Metric	Word Error Rate (WER)
Test WER	68.75%
License	Apache 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご