Wav2Vec2-Large-XLSR-Bengali Open-Source Model - Accurately Implement Automatic Bengali Speech Recognition

Wav2vec2 Large Xlsr Bengali

Developed by arijitx

A Bengali automatic speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained with 40,000 speech samples from the OpenSLR dataset

Speech Recognition Other#Bengali Speech Recognition #Low-resource Language ASR #XLSR-53 Fine-tuning

Downloads 758

Release Time : 3/2/2022

Model Overview

This is a model specifically designed for Bengali automatic speech recognition (ASR), capable of converting Bengali speech into text.

Model Features

High Accuracy Bengali Recognition

A speech recognition model optimized specifically for Bengali, achieving a word error rate of 32.45% on the test set

Based on XLSR Architecture

Fine-tuned from facebook's wav2vec2-large-xlsr-53 model, utilizing cross-lingual speech representation learning

Large-scale Training Data

Trained with approximately 40,000 Bengali speech samples from the OpenSLR dataset

Model Capabilities

Bengali Speech Recognition

Audio to Text Conversion

16kHz Sampling Rate Audio Processing

Use Cases

Speech Transcription

Bengali Speech Transcription

Convert Bengali speech content into text format

Word error rate 32.45%

Voice Assistants

Bengali Voice Interaction

Provide speech recognition capabilities for Bengali voice assistants

🚀 Wav2Vec2-Large-XLSR-Bengali

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Bengali. It uses a subset of 40,000 utterances from the Bengali ASR training data set containing ~196K utterances. The Word Error Rate (WER) is tested using ~4200 utterances held out from training.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz. The train script can be found at train.py.

Data Preparation Notebook: Link
Inference Notebook: Link

✨ Features

Language: Bengali
Datasets: OpenSLR
Metrics: Word Error Rate (WER)
Tags: bn, audio, automatic-speech-recognition, speech
License: CC BY - SA 4.0

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("arijitx/wav2vec2-large-xlsr-bengali")
model = Wav2Vec2ForCTC.from_pretrained("arijitx/wav2vec2-large-xlsr-bengali")
# model = model.to("cuda")

resampler = torchaudio.transforms.Resample(TEST_AUDIO_SR, 16_000)
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch)
    speech =  resampler(speech_array).squeeze().numpy()
    return speech

speech_array = speech_file_to_array_fn("test_file.wav")
inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

    
predicted_ids = torch.argmax(logits, dim=-1)
preds = processor.batch_decode(predicted_ids)[0]
print(preds.replace("[PAD]",""))

Test Result: WER on ~4200 utterance : 32.45 %

📚 Documentation

The model is fine - tuned on a subset of Bengali ASR training data. The test WER is calculated using held - out data. When using the model, ensure the input speech is sampled at 16kHz.

🔧 Technical Details

The model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Bengali. It uses a subset of 40,000 utterances from the Bengali ASR training data set. The test WER is 32.45% on ~4200 held - out utterances.

📄 License

This model is released under the CC BY - SA 4.0 license.

Model Index

Property	Details
Model Name	XLSR Wav2Vec2 Bengali by Arijit
Task	Speech Recognition (automatic - speech - recognition)
Dataset	OpenSLR (ben)
Metrics	Test WER: 32.45%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご