nepaliBERT Open-source Language Model - Achieving Text Understanding and Analysis Based on Nepalese News Data

Nepalibert

Developed by Shushant

A masked language model based on Nepali news data, trained on approximately 10 million Nepali sentences sourced from multiple Nepali news websites, primarily containing news content.

Large Language Model

Transformers

OtherOpen Source License:MIT #Nepali news processing #Sanskrit language model #Low perplexity MLM

Downloads 701.51k

Release Time : 3/2/2022

Model Overview

This model is a fine-tuned Nepali masked language model based on the BERT architecture, mainly used for Nepali-related natural language processing tasks.

Model Features

Large-scale Nepali training data

The training data includes approximately 10 million Nepali sentences, primarily from news websites, with a text volume of about 4.6GB.

High-performance evaluation results

Performs well on the evaluation set, with a loss value of 1.0495 and a perplexity of 8.56.

GPU-accelerated training

Trained using a Tesla V100 GPU, taking approximately 3 days, 8 hours, and 57 minutes.

Model Capabilities

Nepali text understanding

Nepali text generation

Nepali sentiment analysis

Use Cases

Natural Language Processing

Nepali sentiment analysis

Used to analyze the sentiment tendencies of Nepali tweets

Outperforms other existing Nepali masked language models

Nepali text completion

Fills in missing parts of Nepali sentences

🚀 NEPALI BERT

A masked language model for the Nepali language, trained on Nepali news scraped from various Nepali news websites. The dataset contains approximately 10 million Nepali sentences, primarily related to Nepali news.

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shushant/nepaliBERT")
model = AutoModelForMaskedLM.from_pretrained("Shushant/nepaliBERT")

from transformers import pipeline

fill_mask = pipeline( "fill-mask", model=model, tokenizer=tokenizer, ) 
from pprint import pprint 
pprint(fill_mask(f"तिमीलाई कस्तो {tokenizer.mask_token}."))

✨ Features

This model is a fine - tuned version of Bert Base Uncased on a dataset composed of different news scraped from Nepali news portals, which contains 4.6 GB of textual data.
It can be used for any NLP tasks related to the Devanagari language.
In intrinsic evaluation with a perplexity of 8.56, it achieves the state - of - the - art performance. In extrinsic evaluation on sentiment analysis of Nepali tweets, it outperforms other existing masked language models on the Nepali dataset.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shushant/nepaliBERT")
model = AutoModelForMaskedLM.from_pretrained("Shushant/nepaliBERT")

from transformers import pipeline

fill_mask = pipeline( "fill-mask", model=model, tokenizer=tokenizer, ) 
from pprint import pprint 
pprint(fill_mask(f"तिमीलाई कस्तो {tokenizer.mask_token}."))

📚 Documentation

Model description

Pretraining was done on the BERT base architecture.

Intended uses & limitations

This transformer model can be used for any NLP tasks related to the Devanagari language. At the time of training, it was the state - of - the - art model developed for the Devanagari dataset.

Training and evaluation data

The training corpus was developed using 85467 news scraped from different job portals. This is a preliminary dataset for the experimentation. The corpus size is about 4.3 GB of textual data. Similarly, the evaluation data contains a few news articles, about 12 MB of textual data.

Training procedure

For the pretraining of the masked language model, the Trainer API from Huggingface was used. The pretraining took about 3 days 8 hours 57 minutes. Training was done on a Tesla V100 GPU. With 640 Tensor Cores, Tesla V100 is the world's first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep - learning performance. This GPU was facilitated by the Kathmandu University (KU) supercomputer. Thanks to the KU administration.

Data Description

Trained on about 4.6 GB of Nepali text corpus collected from various sources. These data were collected from Nepali news sites and the OSCAR Nepali corpus.

Paper and Citation Details

If you are interested in reading the implementation details of this language model, you can read the full paper here.

Plain Text

S. Pudasaini, S. Shakya, A. Tamang, S. Adhikari, S. Thapa and S. Lamichhane, "NepaliBERT: Pre - training of Masked Language Model in Nepali Corpus," 2023 7th International Conference on I - SMAC (IoT in Social, Mobile, Analytics and Cloud) (I - SMAC), Kirtipur, Nepal, 2023, pp. 325 - 330, doi: 10.1109/I - SMAC58438.2023.10290690.

Bibtex

@INPROCEEDINGS{10290690,
  author={Pudasaini, Shushanta and Shakya, Subarna and Tamang, Aakash and Adhikari, Sajjan and Thapa, Sunil and Lamichhane, Sagar},
  booktitle={2023 7th International Conference on I - SMAC (IoT in Social, Mobile, Analytics and Cloud) (I - SMAC)}, 
  title={NepaliBERT: Pre - training of Masked Language Model in Nepali Corpus}, 
  year={2023},
  volume={},
  number={},
  pages={325 - 330},
  doi={10.1109/I - SMAC58438.2023.10290690}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご