Vakyansh-wav2vec2-hindi-him-4200 Open Source Model - Efficiently Achieve Hindi Automatic Speech Recognition

Vakyansh Wav2vec2 Hindi Him 4200

Developed by Harveenchadha

Hindi automatic speech recognition model based on Wav2Vec2 architecture, developed by Harveen Chadha, fine-tuned on 4200 hours of annotated Hindi data

Speech Recognition

Transformers

OtherOpen Source License:MIT #Hindi speech recognition #No language model output #Trained on 4200 hours of data

Downloads 2,621

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Hindi, based on Facebook's Wav2Vec2 architecture, fine-tuned from the CLSRIL-23 multilingual pre-trained model.

Model Features

Large-scale Hindi data training

Fine-tuned on 4200 hours of annotated Hindi data

Multilingual pre-training foundation

Fine-tuned from the CLSRIL-23 multilingual pre-trained model

No language model required

Can be used directly for inference without additional language models

Model Capabilities

Hindi speech recognition

16kHz audio processing

Use Cases

Speech transcription

Hindi speech to text

Convert Hindi speech content into text

Achieves a WER of 33.17% on the Common Voice Hindi test set

🚀 Wav2Vec2 Vakyansh Hindi Model

This model is designed for automatic speech recognition in Hindi, offering a solution for transcribing Hindi audio.

🚀 Quick Start

Check the spaces demo here.

✨ Features

Fine - tuned Model: Fine - tuned on Multilingual Pretrained Model CLSRIL - 23.
High - Quality Training: Trained on 4200 hours of Hindi Labelled Data.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

import soundfile as sf
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import argparse

def parse_transcription(wav_file):
    # load pretrained model
    processor = Wav2Vec2Processor.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")
    model = Wav2Vec2ForCTC.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")

    # load audio
    audio_input, sample_rate = sf.read(wav_file)

    # pad input values and return pt tensor
    input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values

    # INFERENCE
    # retrieve logits & take argmax
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    # transcribe
    transcription = processor.decode(predicted_ids[0], skip_special_tokens=True)
    print(transcription)

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "hi", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")
model = Wav2Vec2ForCTC.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200")
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
      logits = model(inputs.input_values.to("cuda")).logits

      pred_ids = torch.argmax(logits, dim=-1)
      batch["pred_strings"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
      return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 Documentation

Pretrained Model

Fine - tuned on Multilingual Pretrained Model CLSRIL - 23. The original fairseq checkpoint is present [here](https://github.com/Open - Speech - EkStep/vakyansh - models). When using this model, make sure that your speech input is sampled at 16kHz.

Dataset

This model was trained on 4200 hours of Hindi Labelled Data. The labelled data is not present in public domain as of now.

Training Script

Models were trained using experimental platform setup by Vakyansh team at Ekstep. Here is the [training repository](https://github.com/Open - Speech - EkStep/vakyansh - wav2vec2 - experimentation).

In case you want to explore training logs on wandb they are [here](https://wandb.ai/harveenchadha/hindi_finetuning_multilingual?workspace=user - harveenchadha).

Colab Demo

You can check the Colab Demo here.

Evaluation

The model can be evaluated as follows on the hindi test data of Common Voice. The test result shows a WER of 33.17%. You can also check the Colab Evaluation.

Credits

Thanks to Ekstep Foundation for making this possible. The vakyansh team will be open sourcing speech models in all the Indic Languages.

🔧 Technical Details

The model is based on the Wav2Vec2 architecture and fine - tuned on a multilingual pretrained model. It is trained on a large amount of Hindi labelled data to achieve better performance in Hindi speech recognition.

📄 License

This project is licensed under the MIT license.

⚠️ Important Note

The result from this model is without a language model so you may witness a higher WER in some cases.

Property	Details
Model Type	Wav2Vec2 Vakyansh Hindi Model by Harveen Chadha
Training Data	4200 hours of Hindi Labelled Data
Metrics	WER (Word Error Rate)
Task	Automatic Speech Recognition
Dataset	Common Voice hi
Test WER	33.17

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご