🚀 NEPALI BERT
A masked language model for the Nepali language, trained on Nepali news scraped from various Nepali news websites. The dataset contains approximately 10 million Nepali sentences, primarily related to Nepali news.
🚀 Quick Start
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shushant/nepaliBERT")
model = AutoModelForMaskedLM.from_pretrained("Shushant/nepaliBERT")
from transformers import pipeline
fill_mask = pipeline( "fill-mask", model=model, tokenizer=tokenizer, )
from pprint import pprint
pprint(fill_mask(f"तिमीलाई कस्तो {tokenizer.mask_token}."))
✨ Features
- This model is a fine - tuned version of Bert Base Uncased on a dataset composed of different news scraped from Nepali news portals, which contains 4.6 GB of textual data.
- It can be used for any NLP tasks related to the Devanagari language.
- In intrinsic evaluation with a perplexity of 8.56, it achieves the state - of - the - art performance. In extrinsic evaluation on sentiment analysis of Nepali tweets, it outperforms other existing masked language models on the Nepali dataset.
📦 Installation
No specific installation steps are provided in the original README.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shushant/nepaliBERT")
model = AutoModelForMaskedLM.from_pretrained("Shushant/nepaliBERT")
from transformers import pipeline
fill_mask = pipeline( "fill-mask", model=model, tokenizer=tokenizer, )
from pprint import pprint
pprint(fill_mask(f"तिमीलाई कस्तो {tokenizer.mask_token}."))
📚 Documentation
Model description
Pretraining was done on the BERT base architecture.
Intended uses & limitations
This transformer model can be used for any NLP tasks related to the Devanagari language. At the time of training, it was the state - of - the - art model developed for the Devanagari dataset.
Training and evaluation data
The training corpus was developed using 85467 news scraped from different job portals. This is a preliminary dataset for the experimentation. The corpus size is about 4.3 GB of textual data. Similarly, the evaluation data contains a few news articles, about 12 MB of textual data.
Training procedure
For the pretraining of the masked language model, the Trainer API from Huggingface was used. The pretraining took about 3 days 8 hours 57 minutes. Training was done on a Tesla V100 GPU. With 640 Tensor Cores, Tesla V100 is the world's first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep - learning performance. This GPU was facilitated by the Kathmandu University (KU) supercomputer. Thanks to the KU administration.
Data Description
Trained on about 4.6 GB of Nepali text corpus collected from various sources. These data were collected from Nepali news sites and the OSCAR Nepali corpus.
Paper and Citation Details
If you are interested in reading the implementation details of this language model, you can read the full paper here.
Plain Text
S. Pudasaini, S. Shakya, A. Tamang, S. Adhikari, S. Thapa and S. Lamichhane, "NepaliBERT: Pre - training of Masked Language Model in Nepali Corpus," 2023 7th International Conference on I - SMAC (IoT in Social, Mobile, Analytics and Cloud) (I - SMAC), Kirtipur, Nepal, 2023, pp. 325 - 330, doi: 10.1109/I - SMAC58438.2023.10290690.
Bibtex
@INPROCEEDINGS{10290690,
author={Pudasaini, Shushanta and Shakya, Subarna and Tamang, Aakash and Adhikari, Sajjan and Thapa, Sunil and Lamichhane, Sagar},
booktitle={2023 7th International Conference on I - SMAC (IoT in Social, Mobile, Analytics and Cloud) (I - SMAC)},
title={NepaliBERT: Pre - training of Masked Language Model in Nepali Corpus},
year={2023},
volume={},
number={},
pages={325 - 330},
doi={10.1109/I - SMAC58438.2023.10290690}
}
📄 License
This project is licensed under the MIT license.