đ Wav2Vec2-Large-XLSR-53-Marathi
This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Marathi using a part of the InterSpeech 2021 Marathi dataset. It's designed for automatic speech recognition of Marathi, with a requirement that the speech input should be sampled at 16kHz.
đ Model Information
Property |
Details |
Model Type |
XLSR Wav2Vec2 Large 53 Marathi 2 by Gunjan Chhablani |
Training Datasets |
interspeech_2021_asr |
Evaluation Metrics |
wer |
Tags |
audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week |
License |
apache - 2.0 |
Results |
Task: Speech Recognition (automatic - speech - recognition) Dataset: InterSpeech 2021 ASR mr (interspeech_2021_asr) Metrics: Test WER = 14.53 |
đ Quick Start
This fine - tuned model is based on facebook/wav2vec2-large-xlsr-53 and trained on a part of the InterSpeech 2021 Marathi dataset. When using this model, ensure that your speech input is sampled at 16kHz.
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
resampler = torchaudio.transforms.Resample(8_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
đ Documentation
Evaluation
The model can be evaluated on the test set of the Marathi data from InterSpeech - 2021 as follows:
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â\'\īŋŊ]'
resampler = torchaudio.transforms.Resample(8_000, 16_000)
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"),
attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Test Result: 19.98 % (555 examples from test set were used for evaluation)
Test Result on 10% of OpenSLR74 data: 64.64 %
Training
5000 examples from the InterSpeech Marathi dataset were used for training. The Colab notebook used for training can be found here.
đ License
This model is licensed under the apache - 2.0
license.