BanglaASR: An Open-Source Bangla Automatic Speech Recognition Model - Free Deployment, Precise Transcription with Low WER

Banglaasr

Developed by bangla-speech-processing

This is a Bengali automatic speech recognition model based on the Whisper small architecture, fine-tuned on approximately 400 hours of Mozilla Common Voice dataset with a word error rate of 4.58%

Speech Recognition

Transformers

Open Source License:MIT #Bengali speech recognition #Low word error rate (4.58%)#Whisper fine-tuning

Downloads 782

Release Time : 6/22/2023

Model Overview

This model is specifically designed for Bengali speech recognition tasks, fine-tuned from the Transformer-based Whisper model

Model Features

High Accuracy Recognition

Achieves a word error rate of 4.58% on Bengali speech recognition tasks

Specialized Optimization

Whisper model specifically optimized for Bengali

Medium Scale

Uses the small variant with 244M parameters, balancing performance and resource requirements

Model Capabilities

Bengali speech-to-text

Long audio processing

Real-time speech recognition

Use Cases

Speech Transcription

Voice Recording Transcription

Automatically convert Bengali voice recordings to text

95.42% accuracy

Voice Assistant

Provide recognition capabilities for Bengali voice assistants

Education

Language Learning Assistance

Help learners practice Bengali pronunciation and listening

🚀 Bangla ASR Model

A Bangla Automatic Speech Recognition (ASR) model fine - tuned on the Bangla Mozilla Common Voice Dataset.

🚀 Quick Start

The Bangla ASR model is fine - tuned from the Whisper model using the Bangla Mozilla Common Voice Dataset. It was trained on approximately 400 hours of data, with 40k samples for training and 7k for validation. After 12000 training steps, it achieved a Word Error Rate (WER) of 4.58%.

💻 Usage Examples

Basic Usage

import os
import librosa
import torch
import torchaudio
import numpy as np

from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"

model_path = "bangla-speech-processing/BanglaASR"


feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)


speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features

# batch = processor.feature_extractor.pad(input_features, return_tensors="pt")
predicted_ids = model.generate(inputs=input_features.to(device))[0]


transcription = processor.decode(predicted_ids, skip_special_tokens=True)

print(transcription)

📚 Documentation

Dataset

The model uses the Mozilla Common Voice Dataset, which includes around 400 hours of data for both training (40k samples) and validation (7k samples) in MP3 format. For more information about the dataset, please click here.

Training Model Information

Property	Details
Model Type	Whisper small[244 M] variant model
Training Data	Mozilla common voice dataset around 400 hours data (40k for training and 7k for validation)

Size	Layers	Width	Heads	Parameters	Bangla - only	Training Status
tiny	4	384	6	39 M	X	X
base	6	512	8	74 M	X	X
small	12	768	12	244 M	✓	✓
medium	24	1024	16	769 M	X	X
large	32	1280	20	1550 M	X	X

Evaluation

The Word Error Rate (WER) of the model is 4.58%. For more information, please check the github.

📄 License

This project is licensed under the MIT license.

📖 Citation

@misc{BanglaASR ,
  title={Transformer Based Whisper Bangla ASR Model},
  author={Md Saiful Islam},
  howpublished={},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご