đ Bangla ASR Model
A Bangla Automatic Speech Recognition (ASR) model fine - tuned on the Bangla Mozilla Common Voice Dataset.
đ Quick Start
The Bangla ASR model is fine - tuned from the Whisper model using the Bangla Mozilla Common Voice Dataset. It was trained on approximately 400 hours of data, with 40k samples for training and 7k for validation. After 12000 training steps, it achieved a Word Error Rate (WER) of 4.58%.
đģ Usage Examples
Basic Usage
import os
import librosa
import torch
import torchaudio
import numpy as np
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"
model_path = "bangla-speech-processing/BanglaASR"
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(inputs=input_features.to(device))[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
print(transcription)
đ Documentation
Dataset
The model uses the Mozilla Common Voice Dataset, which includes around 400 hours of data for both training (40k samples) and validation (7k samples) in MP3 format. For more information about the dataset, please click here.
Training Model Information
Property |
Details |
Model Type |
Whisper small[244 M] variant model |
Training Data |
Mozilla common voice dataset around 400 hours data (40k for training and 7k for validation) |
Size |
Layers |
Width |
Heads |
Parameters |
Bangla - only |
Training Status |
tiny |
4 |
384 |
6 |
39 M |
X |
X |
base |
6 |
512 |
8 |
74 M |
X |
X |
small |
12 |
768 |
12 |
244 M |
â |
â |
medium |
24 |
1024 |
16 |
769 M |
X |
X |
large |
32 |
1280 |
20 |
1550 M |
X |
X |
Evaluation
The Word Error Rate (WER) of the model is 4.58%. For more information, please check the github.
đ License
This project is licensed under the MIT license.
đ Citation
@misc{BanglaASR ,
title={Transformer Based Whisper Bangla ASR Model},
author={Md Saiful Islam},
howpublished={},
year={2023}
}