đ Wav2Vec2 Vakyansh Hindi Model
This model is designed for automatic speech recognition in Hindi, offering a solution for transcribing Hindi audio.
đ Quick Start
Check the spaces demo here.
⨠Features
- Fine - tuned Model: Fine - tuned on Multilingual Pretrained Model CLSRIL - 23.
- High - Quality Training: Trained on 4200 hours of Hindi Labelled Data.
đĻ Installation
No specific installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
import soundfile as sf
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import argparse
def parse_transcription(wav_file):
processor = Wav2Vec2Processor.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")
model = Wav2Vec2ForCTC.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")
audio_input, sample_rate = sf.read(wav_file)
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0], skip_special_tokens=True)
print(transcription)
Advanced Usage
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
test_dataset = load_dataset("common_voice", "hi", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")
model = Wav2Vec2ForCTC.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")
model.to("cuda")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â]'
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
đ Documentation
Pretrained Model
Fine - tuned on Multilingual Pretrained Model CLSRIL - 23. The original fairseq checkpoint is present [here](https://github.com/Open - Speech - EkStep/vakyansh - models). When using this model, make sure that your speech input is sampled at 16kHz.
Dataset
This model was trained on 4200 hours of Hindi Labelled Data. The labelled data is not present in public domain as of now.
Training Script
Models were trained using experimental platform setup by Vakyansh team at Ekstep. Here is the [training repository](https://github.com/Open - Speech - EkStep/vakyansh - wav2vec2 - experimentation).
In case you want to explore training logs on wandb they are [here](https://wandb.ai/harveenchadha/hindi_finetuning_multilingual?workspace=user - harveenchadha).
Colab Demo
You can check the Colab Demo here.
Evaluation
The model can be evaluated as follows on the hindi test data of Common Voice. The test result shows a WER of 33.17%. You can also check the Colab Evaluation.
Credits
Thanks to Ekstep Foundation for making this possible. The vakyansh team will be open sourcing speech models in all the Indic Languages.
đ§ Technical Details
The model is based on the Wav2Vec2 architecture and fine - tuned on a multilingual pretrained model. It is trained on a large amount of Hindi labelled data to achieve better performance in Hindi speech recognition.
đ License
This project is licensed under the MIT license.
â ī¸ Important Note
The result from this model is without a language model so you may witness a higher WER in some cases.
Property |
Details |
Model Type |
Wav2Vec2 Vakyansh Hindi Model by Harveen Chadha |
Training Data |
4200 hours of Hindi Labelled Data |
Metrics |
WER (Word Error Rate) |
Task |
Automatic Speech Recognition |
Dataset |
Common Voice hi |
Test WER |
33.17 |