đ Bangla BERT Base
We have published a pre - trained Bangla BERT language model named bangla - bert, which is now available on the Hugging Face model hub. This model is based on the mask language modeling described in BERT and the GitHub [repository](https://github.com/google - research/bert).
⨠Features
- Tags: Bert base Bangla, Bengali Bert, Bengali lm, Bangla Base Bert, Bangla Bert language model, Bangla Bert
- Datasets: BanglaLM dataset
đĻ Installation
The model is available on the Hugging Face model hub. You can directly use it through the Hugging Face transformers
library as shown in the usage examples below.
đģ Usage Examples
Basic Usage
bangla - bert Tokenizer
from transformers import AutoTokenizer, AutoModel
bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bangla-bert")
text = "āĻāĻžāĻāĻāĻŋ āϏā§āύāĻžāϰ āĻāĻžāĻāϤ⧠āĻāĻžāĻāĻāĻŋ āĻāĻŽāĻžāϰ āĻĻā§āĻļā§āϰ āĻŽāĻžāĻāĻŋ"
bnbert_tokenizer.tokenize(text)
Advanced Usage
MASK Generation
Here, we can use the Bangla BERT base model for masked language modeling:
from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained("Kowsher/bangla-bert")
tokenizer = BertTokenizer.from_pretrained("Kowsher/bangla-bert")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϰ āĻāĻžāύ {nlp.tokenizer.mask_token}"):
print(pred)
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"āϤā§āĻ āϰāĻžāĻāĻžāĻāĻžāϰ āϤā§āĻ {nlp.tokenizer.mask_token}"):
print(pred)
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"āĻŦāĻžāĻāϞāĻž āĻāĻŽāĻžāϰ {nlp.tokenizer.mask_token}"):
print(pred)
đ Documentation
Corpus Details
We trained the Bangla BERT language model using the BanglaLM dataset from Kaggle BanglaLM. There are 3 versions of the dataset, which is approximately 40GB. After downloading the dataset, we proceeded with masked language modeling.
đ License
No license information is provided in the original document.
đ Citation
M. Kowsher, A. A. Sami, N. J. Prottasha, M. S. Arefin, P. K. Dhar and T. Koshiba, "Bangla - BERT: Transformer - based Efficient Model for Transfer Learning and Language Understanding," in IEEE Access, 2022, doi: 10.1109/ACCESS.2022.3197662.
đ¨âđģ Author
Kowsher