đ Wav2Vec2-Large-XLSR-53-Marathi
This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Marathi using the Open SLR64 dataset. It can be used for automatic speech recognition tasks.
đ Quick Start
This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Marathi using the Open SLR64 dataset. When using this model, ensure that your speech input is sampled at 16kHz. Although the training data contains only female voices, the model performs well for male voices too. It was trained on Google Colab Pro with a Tesla P100 16GB GPU.
WER (Word Error Rate) on the Test Set: 12.70 %
⨠Features
- Fine - tuned on Marathi language using the Open SLR64 dataset.
- Can handle both female and male voices well.
- Trained on a powerful GPU (Tesla P100 16GB) on Google Colab Pro.
đģ Usage Examples
Basic Usage
The model can be used directly without a language model as follows, given that your dataset has Marathi actual_text
and path_in_folder
columns:
import torch, torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
mr_test_dataset = all_data['test']
processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
inputs = processor(mr_test_dataset["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", mr_test_dataset["actual_text"][:5])
Advanced Usage
Evaluated on 10% of the Marathi data on Open SLR - 64.
import os, re, torch, torchaudio
from datasets import Dataset, load_metric
import pandas as pd
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
dataset_path = "./OpenSLR-64_Marathi/mr_in_female/"
audio_df = pd.read_csv(os.path.join(dataset_path,'line_index.tsv'),sep='\t',header=None)
audio_df.columns = ['path_in_folder','actual_text']
audio_df['path_in_folder'] = audio_df['path_in_folder'].apply(lambda x: dataset_path + x + '.wav')
audio_df = audio_df.sample(frac=1, random_state=2020).reset_index(drop=True)
all_data = Dataset.from_pandas(audio_df)
all_data = all_data.train_test_split(test_size=0.10,seed=2020)
mr_test_dataset = all_data['test']
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
batch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = mr_test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))
đ Documentation
Training
The Train - Test ratio was 90:10.
The training notebook Colab link here.
Training Config and Summary
The weights - and - biases run summary is available here
đ License
This project is licensed under the Apache 2.0 license.
Property |
Details |
Model Type |
Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Marathi |
Training Data |
Open SLR64 Marathi dataset |
Tags |
audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week |
Metrics |
WER |
Base Model |
facebook/wav2vec2 - large - xlsr - 53 |
Model Name |
XLSR Wav2Vec2 Large 53 Marathi by Sumedh Khodke |
Test WER |
12.7 |