wav2vec2-xls-r-300m-bengali Open-source Model - Realize Automatic Speech Recognition for Bengali

Wav2vec2 Xls R 300m Bengali

Developed by arijitx

A Bengali automatic speech recognition model fine-tuned from facebook/wav2vec2-xls-r-300m, trained on the OpenSLR_SLR53 dataset

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Bengali speech recognition #Low Character Error Rate (CER)#5-gram language model enhancement

Downloads 533

Release Time : 3/2/2022

Model Overview

This is an optimized automatic speech recognition (ASR) model for Bengali, fine-tuned based on Facebook's wav2vec2-xls-r-300m architecture, specifically designed for Bengali speech-to-text tasks.

Model Features

High-accuracy Bengali recognition

Achieves a Word Error Rate (WER) of 0.153 and Character Error Rate (CER) of 0.034 on the OpenSLR_SLR53 test set

Language model integration support

Can be combined with a 5-gram language model to further improve recognition accuracy

Professional dataset training

Fine-tuned using the OpenSLR_SLR53 professional Bengali dataset

Optimized training parameters

Uses data augmentation techniques such as audio time masking (0.75 probability) and feature masking (0.25 probability)

Model Capabilities

Bengali speech recognition

Speech-to-text

Language model integration support

Use Cases

Speech transcription

Bengali meeting minutes

Automatically transcribe Bengali meeting recordings into text records

Accuracy rate of 84.7% (WER 0.153)

Voice assistant

Provide speech recognition capabilities for Bengali voice assistants

Education

Language learning applications

Help learners practice Bengali pronunciation and listening

🚀 arijitx/wav2vec2-xls-r-300m-bengali

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m for Bengali automatic speech recognition, achieving good results on the OPENSLR_SLR53 dataset.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the OPENSLR_SLR53 - Bengali dataset. It achieves the following results on the evaluation set:

Without language model

WER: 0.21726385291857586
CER: 0.04725010353701041

With 5 - gram language model

The 5 - gram language model is trained on 30M sentences randomly chosen from AI4Bharat IndicCorp dataset.

WER: 0.15322879016421437
CER: 0.03413696666806267

⚠️ Important Note

5% of a total 10935 samples have been used for evaluation. The evaluation set has 10935 examples which were not part of the training set. Training was done on the first 95% and evaluation was done on the last 5%. Training was stopped after 180k steps. Output predictions are available under the files section.

📚 Documentation

Model Information

Property	Details
Model Type	Automatic Speech Recognition
Supported Languages	Bengali (bn)
Tags	automatic - speech - recognition, bn, hf - asr - leaderboard, openslr_SLR53, robust - speech - event
Datasets	openslr, SLR53, AI4Bharat/IndicCorp
Metrics	WER, CER

Model Performance

Task	Dataset	WER (Without LM)	CER (Without LM)	WER (With LM)	CER (With LM)
Speech Recognition	Open SLR (SLR53)	0.21726385291857586	0.04725010353701041	0.15322879016421437	0.03413696666806267

Training Hyperparameters

The following hyperparameters were used during training:

dataset_name="openslr"
model_name_or_path="facebook/wav2vec2-xls-r-300m"
dataset_config_name="SLR53"
output_dir="./wav2vec2-xls-r-300m-bengali"
overwrite_output_dir
num_train_epochs="50"
per_device_train_batch_size="32"
per_device_eval_batch_size="32"
gradient_accumulation_steps="1"
learning_rate="7.5e - 5"
warmup_steps="2000"
length_column_name="input_length"
evaluation_strategy="steps"
text_column_name="sentence"
chars_to_ignore , ? . ! - ; : " “ % ‘ ” � — ’ … –
save_steps="2000"
eval_steps="3000"
logging_steps="100"
layerdrop="0.0"
activation_dropout="0.1"
save_total_limit="3"
freeze_feature_encoder
feat_proj_dropout="0.0"
mask_time_prob="0.75"
mask_time_length="10"
mask_feature_prob="0.25"
mask_feature_length="64"
preprocessing_num_workers 32

Framework Versions

Transformers 4.16.0.dev0
Pytorch 1.10.1+cu102
Datasets 1.17.1.dev0
Tokenizers 0.11.0

Notes

Training and eval code modified from: https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust - speech - event.
Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.
A minimum audio duration of 0.5s has been used to filter the training data, which excluded about 10 - 20 samples.
OpenSLR53 transcripts are not part of LM training and the LM used for evaluation.

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご