wav2vec2-large-xlsr-mr-2 Open Source Model - Free Support for Automatic Speech Recognition in Marathi

Wav2vec2 Large Xlsr Mr 2

Developed by gchhablani

This is an automatic speech recognition (ASR) model fine-tuned on Marathi based on the facebook/wav2vec2-large-xlsr-53 model, trained using partial data from the InterSpeech 2021 Marathi dataset.

Speech Recognition OtherOpen Source License:Apache-2.0 #Marathi speech recognition #Low WER model #XLSR fine-tuning

Downloads 13

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Marathi speech recognition tasks, capable of converting Marathi speech into text.

Model Features

High Accuracy Speech Recognition

Achieves 19.98% WER on the InterSpeech 2021 Marathi test set

Multi-Sample Rate Support

Supports 16kHz sample rate input and can process 8kHz audio through resampling

No Language Model Required

Can be used directly without additional language model support

Model Capabilities

Marathi speech recognition

Audio to text conversion

Speech content transcription

Use Cases

Speech Transcription

Marathi Speech Transcription

Convert Marathi speech content into text format

Achieves 19.98% word error rate on the test set

Voice Assistants

Marathi Voice Command Recognition

Used to understand Marathi voice commands

🚀 Wav2Vec2-Large-XLSR-53-Marathi

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Marathi using a part of the InterSpeech 2021 Marathi dataset. It's designed for automatic speech recognition of Marathi, with a requirement that the speech input should be sampled at 16kHz.

📋 Model Information

Property	Details
Model Type	XLSR Wav2Vec2 Large 53 Marathi 2 by Gunjan Chhablani
Training Datasets	interspeech_2021_asr
Evaluation Metrics	wer
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0
Results	Task: Speech Recognition (automatic - speech - recognition) Dataset: InterSpeech 2021 ASR mr (interspeech_2021_asr) Metrics: Test WER = 14.53

🚀 Quick Start

This fine - tuned model is based on facebook/wav2vec2-large-xlsr-53 and trained on a part of the InterSpeech 2021 Marathi dataset. When using this model, ensure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# test_dataset = #TODO: WRITE YOUR CODE TO LOAD THE TEST DATASET. For sample see the Colab link in Training Section.

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")

resampler = torchaudio.transforms.Resample(8_000, 16_000) # The original data was with 8,000 sampling rate. You can change it according to your input.

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated on the test set of the Marathi data from InterSpeech - 2021 as follows:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

# test_dataset = #TODO: WRITE YOUR CODE TO LOAD THE TEST DATASET. For sample see the Colab link in Training Section.

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\'\�]'
resampler = torchaudio.transforms.Resample(8_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), 
        attention_mask=inputs.attention_mask.to("cuda")).logits
        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_strings"] = processor.batch_decode(pred_ids)
        return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 19.98 % (555 examples from test set were used for evaluation)

Test Result on 10% of OpenSLR74 data: 64.64 %

Training

5000 examples from the InterSpeech Marathi dataset were used for training. The Colab notebook used for training can be found here.

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご