Open-source wav2vec2-large-xlsr-marathi model - Accurately implement automatic Marathi speech recognition

Wav2vec2 Large Xlsr Marathi

Developed by sumedh

A Marathi automatic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on OpenSLR Marathi dataset with a test set word error rate of 12.7%.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Marathi speech recognition #Low word error rate (12.7%)#XLSR fine-tuning

Downloads 5,159

Release Time : 3/2/2022

Model Overview

This is a model specifically designed for Marathi automatic speech recognition (ASR), fine-tuned from Facebook's wav2vec2-large-xlsr-53 architecture, suitable for 16kHz sampled speech input.

Model Features

Low word error rate

Achieves a word error rate (WER) of 12.7% on the test set, demonstrating excellent performance.

Gender adaptability

Although trained only on female speech data, it performs well on male speech as well.

No language model required

Can be used directly without additional language model support.

Model Capabilities

Marathi speech recognition

16kHz sample rate audio processing

Use Cases

Speech transcription

Marathi speech to text

Convert Marathi speech content into text

12.7% word error rate

Voice assistants

Marathi voice command recognition

Used to develop voice assistants that understand Marathi

🚀 Wav2Vec2-Large-XLSR-53-Marathi

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Marathi using the Open SLR64 dataset. It can be used for automatic speech recognition tasks.

🚀 Quick Start

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Marathi using the Open SLR64 dataset. When using this model, ensure that your speech input is sampled at 16kHz. Although the training data contains only female voices, the model performs well for male voices too. It was trained on Google Colab Pro with a Tesla P100 16GB GPU.

WER (Word Error Rate) on the Test Set: 12.70 %

✨ Features

Fine - tuned on Marathi language using the Open SLR64 dataset.
Can handle both female and male voices well.
Trained on a powerful GPU (Tesla P100 16GB) on Google Colab Pro.

💻 Usage Examples

Basic Usage

The model can be used directly without a language model as follows, given that your dataset has Marathi actual_text and path_in_folder columns:

import torch, torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

#Since marathi is not present on Common Voice, script for reading the below dataset can be picked up from the eval script below
mr_test_dataset = all_data['test']

processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi") 
model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi") 

resampler = torchaudio.transforms.Resample(48_000, 16_000) #first arg - input sample, second arg - output sample
# Preprocessing the datasets. We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch
mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
inputs = processor(mr_test_dataset["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", mr_test_dataset["actual_text"][:5])

Advanced Usage

Evaluated on 10% of the Marathi data on Open SLR - 64.

import os, re, torch, torchaudio
from datasets import Dataset, load_metric
import pandas as pd
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

#below is a custom script to be used for reading marathi dataset since its not present on the Common Voice
dataset_path = "./OpenSLR-64_Marathi/mr_in_female/" #TODO : include the path of the dataset extracted from http://openslr.org/64/
audio_df = pd.read_csv(os.path.join(dataset_path,'line_index.tsv'),sep='\t',header=None)
audio_df.columns = ['path_in_folder','actual_text']
audio_df['path_in_folder'] = audio_df['path_in_folder'].apply(lambda x: dataset_path + x + '.wav')
audio_df = audio_df.sample(frac=1, random_state=2020).reset_index(drop=True) #seed number is important for reproducibility of WER score
all_data = Dataset.from_pandas(audio_df)
all_data = all_data.train_test_split(test_size=0.10,seed=2020) #seed number is important for reproducibility of WER score

mr_test_dataset = all_data['test']
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' 
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets. We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch
mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch
result = mr_test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))

📚 Documentation

Training

The Train - Test ratio was 90:10. The training notebook Colab link here.

Training Config and Summary

The weights - and - biases run summary is available here

📄 License

This project is licensed under the Apache 2.0 license.

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Marathi
Training Data	Open SLR64 Marathi dataset
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
Metrics	WER
Base Model	facebook/wav2vec2 - large - xlsr - 53
Model Name	XLSR Wav2Vec2 Large 53 Marathi by Sumedh Khodke
Test WER	12.7

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご