Wav2vec2-bn-300m Open-source Model - Free Automatic Speech Recognition for Bengali

Wav2vec2 Bn 300m

Developed by Tahsin-Mayeesha

A fine-tuned Bengali automatic speech recognition model based on facebook/wav2vec2-xls-r-300m, trained using the OPENSLR_SLR53 dataset

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Bengali Speech Recognition #Low CER #5-gram Language Model Optimization

Downloads 25

Release Time : 3/2/2022

Model Overview

This is an optimized automatic speech recognition (ASR) model for Bengali, fine-tuned on the wav2vec2-xls-r-300m architecture, demonstrating excellent performance on the OpenSLR dataset

Model Features

High Accuracy Bengali Recognition

Achieves a word error rate (WER) of 17.78% and a character error rate (CER) of 4.39% on the OpenSLR test set

Supports Language Model Integration

Can be combined with a 5-gram language model to further improve recognition accuracy

Large-scale Training Data

Trained using 218,703 samples from the OPENSLR_SLR53 dataset

Model Capabilities

Bengali Speech Recognition

Speech-to-Text

Supports Language Model Enhancement

Use Cases

Speech Transcription

Bengali Speech Transcription

Convert Bengali speech content into text

Achieved 0.17776 WER (with language model) on the test set

Voice Assistants

Bengali Voice Interaction

Provides speech recognition capabilities for Bengali voice assistants

🚀 Tahsin-Mayeesha/wav2vec2-bn-300m

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the Bengali dataset, aiming to provide high - quality automatic speech recognition.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the OPENSLR_SLR53 - bengali dataset. It achieves the following results on the evaluation set.

✨ Features

Evaluation Results

Without language model:
- Wer: 0.3110
- Cer: 0.072
With 5 - gram language model trained on indic - text dataset:
- Wer: 0.17776
- Cer: 0.04394

Note

10% of a total 218703 samples have been used for evaluation. The evaluation set has 21871 examples. Training was stopped after 30k steps. Output predictions are available under the files section.

📚 Documentation

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 7.5e - 05
train_batch_size: 16
eval_batch_size: 16
gradient_accumulation_steps: 4
optimizer: Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2000
mixed_precision_training: Native AMP

Framework Versions

Transformers 4.16.0.dev0
Pytorch 1.10.1+cu102
Datasets 1.17.1.dev0
Tokenizers 0.11.0

Additional Notes

Training and evaluation script modified from https://huggingface.co/chmanoj/xls - r - 300m - te and https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust - speech - event. Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.
A minimum audio duration of 0.1s has been used to filter the training data, which may have excluded 10 - 20 samples.

📄 License

This model is released under the Apache - 2.0 license.

📦 Model Information

Property	Details
Model Type	Fine - tuned wav2vec2 model for Bengali automatic speech recognition
Training Data	OPENSLR_SLR53 - Bengali dataset
Metrics	Wer, Cer
Datasets	openslr, SLR53, Harveenchadha/indic - text

📚 Model Index

Task	Dataset	Metrics
Automatic Speech Recognition (Speech Recognition)	Open SLR (SLR66)	Test WER: 0.31104373941386626 Test CER: 0.07263099973420006 Test WER with lm: 0.17776164652632478 Test CER with lm: 0.04394092712884769

📖 Citation

@misc {tahsin_mayeesha_2023,
    author       = { {Tahsin Mayeesha} },
    title        = { wav2vec2-bn-300m (Revision e10defc) },
    year         = 2023,
    url          = { https://huggingface.co/Tahsin-Mayeesha/wav2vec2-bn-300m },
    doi          = { 10.57967/hf/0939 },
    publisher    = { Hugging Face }
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご