đ arijitx/wav2vec2-xls-r-300m-bengali
This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m for Bengali automatic speech recognition, achieving good results on the OPENSLR_SLR53 dataset.
đ Quick Start
This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the OPENSLR_SLR53 - Bengali dataset. It achieves the following results on the evaluation set:
Without language model
- WER: 0.21726385291857586
- CER: 0.04725010353701041
With 5 - gram language model
The 5 - gram language model is trained on 30M sentences randomly chosen from AI4Bharat IndicCorp dataset.
- WER: 0.15322879016421437
- CER: 0.03413696666806267
â ī¸ Important Note
5% of a total 10935 samples have been used for evaluation. The evaluation set has 10935 examples which were not part of the training set. Training was done on the first 95% and evaluation was done on the last 5%. Training was stopped after 180k steps. Output predictions are available under the files section.
đ Documentation
Model Information
Property |
Details |
Model Type |
Automatic Speech Recognition |
Supported Languages |
Bengali (bn) |
Tags |
automatic - speech - recognition, bn, hf - asr - leaderboard, openslr_SLR53, robust - speech - event |
Datasets |
openslr, SLR53, AI4Bharat/IndicCorp |
Metrics |
WER, CER |
Model Performance
Task |
Dataset |
WER (Without LM) |
CER (Without LM) |
WER (With LM) |
CER (With LM) |
Speech Recognition |
Open SLR (SLR53) |
0.21726385291857586 |
0.04725010353701041 |
0.15322879016421437 |
0.03413696666806267 |
Training Hyperparameters
The following hyperparameters were used during training:
- dataset_name="openslr"
- model_name_or_path="facebook/wav2vec2-xls-r-300m"
- dataset_config_name="SLR53"
- output_dir="./wav2vec2-xls-r-300m-bengali"
- overwrite_output_dir
- num_train_epochs="50"
- per_device_train_batch_size="32"
- per_device_eval_batch_size="32"
- gradient_accumulation_steps="1"
- learning_rate="7.5e - 5"
- warmup_steps="2000"
- length_column_name="input_length"
- evaluation_strategy="steps"
- text_column_name="sentence"
- chars_to_ignore , ? . ! - ; : " â % â â īŋŊ â â âĻ â
- save_steps="2000"
- eval_steps="3000"
- logging_steps="100"
- layerdrop="0.0"
- activation_dropout="0.1"
- save_total_limit="3"
- freeze_feature_encoder
- feat_proj_dropout="0.0"
- mask_time_prob="0.75"
- mask_time_length="10"
- mask_feature_prob="0.25"
- mask_feature_length="64"
- preprocessing_num_workers 32
Framework Versions
- Transformers 4.16.0.dev0
- Pytorch 1.10.1+cu102
- Datasets 1.17.1.dev0
- Tokenizers 0.11.0
Notes
- Training and eval code modified from: https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust - speech - event.
- Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.
- A minimum audio duration of 0.5s has been used to filter the training data, which excluded about 10 - 20 samples.
- OpenSLR53 transcripts are not part of LM training and the LM used for evaluation.
đ License
This model is licensed under the Apache 2.0 license.