đ Tahsin-Mayeesha/wav2vec2-bn-300m
This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the Bengali dataset, aiming to provide high - quality automatic speech recognition.
đ Quick Start
This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the OPENSLR_SLR53 - bengali dataset. It achieves the following results on the evaluation set.
⨠Features
Evaluation Results
- Without language model:
- With 5 - gram language model trained on indic - text dataset:
- Wer: 0.17776
- Cer: 0.04394
Note
- 10% of a total 218703 samples have been used for evaluation. The evaluation set has 21871 examples. Training was stopped after 30k steps. Output predictions are available under the files section.
đ Documentation
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 7.5e - 05
- train_batch_size: 16
- eval_batch_size: 16
- gradient_accumulation_steps: 4
- optimizer: Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 2000
- mixed_precision_training: Native AMP
Framework Versions
- Transformers 4.16.0.dev0
- Pytorch 1.10.1+cu102
- Datasets 1.17.1.dev0
- Tokenizers 0.11.0
Additional Notes
- Training and evaluation script modified from https://huggingface.co/chmanoj/xls - r - 300m - te and https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust - speech - event. Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.
- A minimum audio duration of 0.1s has been used to filter the training data, which may have excluded 10 - 20 samples.
đ License
This model is released under the Apache - 2.0 license.
đĻ Model Information
Property |
Details |
Model Type |
Fine - tuned wav2vec2 model for Bengali automatic speech recognition |
Training Data |
OPENSLR_SLR53 - Bengali dataset |
Metrics |
Wer, Cer |
Datasets |
openslr, SLR53, Harveenchadha/indic - text |
đ Model Index
Task |
Dataset |
Metrics |
Automatic Speech Recognition (Speech Recognition) |
Open SLR (SLR66) |
Test WER: 0.31104373941386626 Test CER: 0.07263099973420006 Test WER with lm: 0.17776164652632478 Test CER with lm: 0.04394092712884769 |
đ Citation
@misc {tahsin_mayeesha_2023,
author = { {Tahsin Mayeesha} },
title = { wav2vec2-bn-300m (Revision e10defc) },
year = 2023,
url = { https://huggingface.co/Tahsin-Mayeesha/wav2vec2-bn-300m },
doi = { 10.57967/hf/0939 },
publisher = { Hugging Face }
}