Wave2vec2-large-xlsr-Hindi Open-source Hindi Speech Recognition Model

Wave2vec2 Large Xlsr Hindi

Developed by shiwangi27

A Hindi speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained using OpenSLR and Common Voice Hindi datasets, supporting 16kHz sampling rate audio input.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Hindi Speech Recognition #XLSR Fine-tuning #Low-resource Optimization

Downloads 63

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Hindi speech recognition tasks, based on the Wav2Vec2 architecture, suitable for converting Hindi speech to text.

Model Features

Multi-dataset Training

Trained using a combination of OpenSLR and Common Voice Hindi datasets, enhancing the model's data diversity.

Sampling Rate Adaptation

Supports 16kHz sampling rate input, with upsampling applied to 8kHz data during training.

No Language Model Required

Can be used directly without additional language model support.

Model Capabilities

Hindi Speech Recognition

Speech-to-Text

Automatic Speech Transcription

Use Cases

Speech Transcription

Hindi Speech Transcription

Convert Hindi speech content into text format

Achieves a WER of 46.055% on the Common Voice test set.

Voice Assistants

Hindi Voice Command Recognition

Used as a speech recognition module for Hindi voice assistants or voice control systems

🚀 Wav2Vec2-Large-XLSR-Hindi

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Hindi. It uses the OpenSLR Hindi dataset for training and the Common Voice Hindi Test dataset for evaluation, aiming to provide accurate speech recognition for Hindi.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Hindi. It uses the OpenSLR Hindi dataset for training and the Common Voice Hindi Test dataset for evaluation. The OpenSLR Hindi data used for training had a size of 10000 and was randomly sampled. The OpenSLR train and test sets were combined as training data to increase variations. The evaluation was conducted on the Common Voice Test set. Since the OpenSLR data is 8kHz, it was upsampled to 16kHz for training.

When using this model, ensure that your speech input is sampled at 16kHz.

⚠️ Important Note

This is the first iteration of the fine - tuning. The model will be updated if the WER improves in future experiments.

✨ Features

Fine - Tuned on Hindi: Specifically optimized for the Hindi language.
Combined Datasets: Utilizes both OpenSLR Hindi and Common Voice Hindi datasets for better performance.
Upsampling: Upsamples 8kHz data to 16kHz for training.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi") 
model = Wav2Vec2ForCTC.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

Advanced Usage

The model can be evaluated as follows on the Hindi test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "hi", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi") 
model = Wav2Vec2ForCTC.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\�\।\']'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 Documentation

Test Results

Dataset	WER
Test split Common Voice Hindi	46.055 %

Code

The Notebook used for training this model can be found at shiwangi27/googlecolab. A modified version of run_common_voice.py was used for training.

📄 License

This project is licensed under the apache - 2.0 license.

Property	Details
Model Type	Fine - tuned Hindi XLSR Wav2Vec2 Large
Training Data	OpenSLR Hindi, Common Voice
Metrics	WER
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week, xlsr - hindi
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご