Bangla-BERT: An Open-Source Bengali Language Model - Free Support for Masked Language Modeling Tasks

Home

Bangla Bert

Developed by Kowsher

A Bengali language model pre-trained based on the BERT architecture, supporting masked language modeling tasks

Large Language Model

Transformers

Other#Bengali Masked Prediction #Low-resource Language Model #Cultural Context Understanding

Downloads 17

Release Time : 3/2/2022

Model Overview

This is a BERT model specifically optimized for Bengali, suitable for various natural language processing tasks such as text classification, named entity recognition, and masked language prediction.

Model Features

Bengali Optimization

Specifically pre-trained and optimized for the Bengali language

Masked Language Prediction

Supports high-quality masked word prediction tasks

Large-scale Corpus Training

Trained using approximately 40GB of the BanglaLM dataset

Model Capabilities

Text Tokenization

Masked Language Prediction

Text Understanding

Language Modeling

Use Cases

Natural Language Processing

Text Completion

Predicting masked words in sentences

Example shows accurate prediction of common phrases like 'বাংলা আমার অহংকার' (Bengali is my pride)

Repetition Detection

Identifying repetitive patterns in text

Can accurately identify repeated words like 'রাজাকার' (traitor)

🚀 Bangla BERT Base

We have published a pre - trained Bangla BERT language model named bangla - bert, which is now available on the Hugging Face model hub. This model is based on the mask language modeling described in BERT and the GitHub [repository](https://github.com/google - research/bert).

✨ Features

Tags: Bert base Bangla, Bengali Bert, Bengali lm, Bangla Base Bert, Bangla Bert language model, Bangla Bert
Datasets: BanglaLM dataset

📦 Installation

The model is available on the Hugging Face model hub. You can directly use it through the Hugging Face transformers library as shown in the usage examples below.

💻 Usage Examples

Basic Usage

bangla - bert Tokenizer

from transformers import AutoTokenizer, AutoModel
bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bangla-bert")
text = "খাঁটি সোনার চাইতে খাঁটি আমার দেশের মাটি"
bnbert_tokenizer.tokenize(text)
# output: ['খাটি', 'সে', '##ানার', 'চাইতে', 'খাটি', 'আমার', 'দেশের', 'মাটি']

Advanced Usage

MASK Generation Here, we can use the Bangla BERT base model for masked language modeling:

from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained("Kowsher/bangla-bert")
tokenizer = BertTokenizer.from_pretrained("Kowsher/bangla-bert")

nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলার গান {nlp.tokenizer.mask_token}"):
  print(pred)
# {'sequence': 'আমি বাংলার গান লিখি', 'score': 0.17955434322357178, 'token': 24749, 'token_str': 'লিখি'}


nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"তুই রাজাকার তুই {nlp.tokenizer.mask_token}"):
  print(pred)
# {'sequence': 'তুই রাজাকার তুই রাজাকার', 'score': 0.9975168704986572, 'token': 13401, 'token_str': 'রাজাকার'}


nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"বাংলা আমার {nlp.tokenizer.mask_token}"):
  print(pred)
# {'sequence': 'বাংলা আমার অহংকার', 'score': 0.5679506063461304, 'token': 19009, 'token_str': 'অহংকার'}

📚 Documentation

Corpus Details

We trained the Bangla BERT language model using the BanglaLM dataset from Kaggle BanglaLM. There are 3 versions of the dataset, which is approximately 40GB. After downloading the dataset, we proceeded with masked language modeling.

📄 License

No license information is provided in the original document.

📚 Citation

M. Kowsher, A. A. Sami, N. J. Prottasha, M. S. Arefin, P. K. Dhar and T. Koshiba, "Bangla - BERT: Transformer - based Efficient Model for Transfer Learning and Language Understanding," in IEEE Access, 2022, doi: 10.1109/ACCESS.2022.3197662.

👨‍💻 Author

Kowsher

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご