Bangla-bert-base Open-source Bengali Language Model - Supports Multiple NLP Tasks and Free to Use

Bangla Bert Base

Developed by sagorsarker

Bangla BERT Base is a pre-trained Bengali language model based on the BERT architecture, supporting various downstream NLP tasks.

Large Language Model OtherOpen Source License:MIT #Bengali Language Understanding #Masked Language Model #Low-Resource Optimization

Downloads 7,282

Release Time : 3/2/2022

Model Overview

This is a BERT model specifically optimized for Bengali, pre-trained using masked language modeling, suitable for natural language processing tasks such as text classification and named entity recognition.

Model Features

Bengali-Specific Pre-training

Pre-trained specifically for Bengali, outperforming multilingual models on Bengali language tasks.

Optimized Vocabulary

Uses the BNLP toolkit to train a Bengali sentence-piece model containing 102,025 vocabulary items.

Comprehensive Evaluation

Achieves state-of-the-art results on multiple Bengali benchmark tests.

Model Capabilities

Text Classification

Named Entity Recognition

Masked Language Prediction

Sentence Tokenization

Use Cases

Sentiment Analysis

Bengali Sentiment Classification

Analyze the sentiment tendency of Bengali text

Achieved 70.37% accuracy in benchmark tests

Content Moderation

Hate Speech Detection

Identify hate speech in Bengali

Achieved 71.83% accuracy in benchmark tests

News Classification

News Topic Classification

Classify Bengali news by topic

Achieved 89.19% accuracy in benchmark tests

🚀 Bangla BERT Base

Bangla BERT Base is a pre - trained language model for the Bengali language. It uses masked language modeling as described in BERT, and is now available on the Hugging Face model hub.

🚀 Quick Start

A long way passed. Here is our Bangla - Bert! It is now available in the Hugging Face model hub.

Bangla - Bert - Base is a pretrained language model of the Bengali language using masked language modeling described in BERT and its GitHub repository.

✨ Features

Pretrained on Diverse Data: Trained on data from common crawl, Wikipedia, and OSCAR.
State - of - the - art Results: Achieves state - of - the - art results on Bengali classification benchmark datasets and Wikiann datasets.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Bangla BERT Tokenizer

from transformers import AutoTokenizer, AutoModel

bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
text = "আমি বাংলায় গান গাই।"
bnbert_tokenizer.tokenize(text)
# ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']

Advanced Usage

MASK Generation

from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
  print(pred)

# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}

📚 Documentation

Pretrain Corpus Details

The corpus was downloaded from two main sources:

Bengali common crawl corpus downloaded from OSCAR.
Bengali Wikipedia Dump Dataset.

After downloading these corpora, we preprocessed them in a BERT format, which is one sentence per line and an extra newline for new documents.

sentence 1
sentence 2

sentence 1
sentence 2

Building Vocab

We used the BNLP package for training the Bengali sentencepiece model with a vocab size of 102025. We preprocessed the output vocab file in a BERT format. Our final vocab file is available at https://github.com/sagorbrur/bangla-bert and also at the Hugging Face model hub.

Training Details

Bangla - Bert was trained with the code provided in Google BERT's GitHub repository (https://github.com/google-research/bert).
The currently released model follows the bert - base - uncased model architecture (12 - layer, 768 - hidden, 12 - heads, 110M parameters).
Total Training Steps: 1 Million.
The model was trained on a single Google Cloud GPU.

Evaluation Results

LM Evaluation Results

After training for 1 million steps, here are the evaluation results:

global_step = 1000000
loss = 2.2406516
masked_lm_accuracy = 0.60641736
masked_lm_loss = 2.201459
next_sentence_accuracy = 0.98625
next_sentence_loss = 0.040997364
perplexity = numpy.exp(2.2406516) = 9.393331287442784
Loss for final step: 2.426227

Downstream Task Evaluation Results

Evaluation on Bengali Classification Benchmark Datasets Huge thanks to Nick Doiron for providing the evaluation results of the classification task. He used the Bengali Classification Benchmark datasets for the classification task. Comparing to Nick's Bengali electra and multi - lingual BERT, Bangla BERT Base achieves a state - of - the - art result. Here is the evaluation script.

Model	Sentiment Analysis	Hate Speech Task	News Topic Task	Average
mBERT	68.15	52.32	72.27	64.25
Bengali Electra	69.19	44.84	82.33	65.45
Bangla BERT Base	70.37	71.83	89.19	77.13

Evaluation on Wikiann Datasets We evaluated Bangla - BERT - Base with the Wikiann Bengali NER datasets along with another three benchmark models (mBERT, XLM - R, Indic - BERT). Bangla - BERT - Base got third place, where mBERT got first and XML - R got second place after training these models for 5 epochs.

Base Pre - trained Model	F1 Score	Accuracy
mBERT - uncased	97.11	97.68
XLM - R	96.22	97.03
Indic - BERT	92.66	94.74
Bangla - BERT - Base	95.57	97.49

All four models were trained with the transformers - token - classification notebook. You can find all models' evaluation results here.

Also, you can check the below paper list. They used this model on their datasets:

NB: If you use this model for any NLP task, please share the evaluation results with us. We will add it here.

Limitations and Biases

No specific content provided in the original document, so this section is skipped.

🔧 Technical Details

The model uses masked language modeling as described in the BERT paper. It is trained on a large corpus of Bengali text from common crawl, Wikipedia, and OSCAR. The training process follows the BERT - base - uncased architecture with 12 layers, 768 hidden units, 12 heads, and 110M parameters.

📄 License

This project is licensed under the MIT license.

📖 Author

Sagor Sarker

📚 Reference

https://github.com/google-research/bert

📝 Citation

If you find this model helpful, please cite:

@misc{Sagor_2020,
  title   = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understanding},
  author  = {Sagor Sarker},
  year    = {2020},
  url    = {https://github.com/sagorbrur/bangla-bert}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご