distilbert-base-nepali Open Source Model - A Pre-trained Tool Focused on Nepali Downstream Tasks

Home

Distilbert Base Nepali

Developed by Sakonii

This is a DistilBERT model pretrained on Nepali text, specifically optimized for downstream tasks in Nepali.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Nepali Pretraining #Masked Language Modeling #Lightweight BERT

Downloads 109

Release Time : 3/2/2022

Model Overview

The model was pretrained on a dataset containing over 13 million Nepali text sequences using the Masked Language Modeling (MLM) objective, suitable for Nepali text processing tasks.

Model Features

Nepali Optimization

Specifically trained for Nepali using a dataset of 13 million text sequences.

Efficient Architecture

Utilizes the DistilBERT architecture to reduce model size and computational requirements while maintaining performance.

Custom Tokenizer

Uses a SentencePiece Model (SPM) for text tokenization with a vocabulary size of 24576.

Model Capabilities

Nepali Text Understanding

Masked Language Prediction

Downstream Task Fine-tuning

Use Cases

Text Processing

Text Completion

Predicts masked tokens in text

Achieved a perplexity of 10.479 on the evaluation set

Downstream Task Fine-tuning

Can be used for sequence classification, token classification, or question answering tasks

🚀 distilbert-base-nepali

This model is pre - trained on the nepalitext dataset, which contains over 13 million Nepali text sequences. It uses a masked language modeling (MLM) objective. We train a Sentence Piece Model (SPM) for text tokenization, similar to XLM - ROBERTa, and train the distilbert model for language modeling. More details can be found in this paper.

🚀 Quick Start

This model can be used directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
>>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")

[{'score': 0.04128897562623024,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 2605,
  'token_str': 'मौसम'},
 {'score': 0.04100276157259941,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 2792,
  'token_str': 'प्रकृति'},
 {'score': 0.026525357738137245,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 387,
  'token_str': 'पानी'},
 {'score': 0.02340106852352619,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 1313,
  'token_str': 'जल'},
 {'score': 0.02055591531097889,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 790,
  'token_str': 'वातावरण'}]

Here is how we can use the model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')

# prepare input
text = "चाहिएको text यता राख्नु होला।"
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)

✨ Features

Pre - trained on Large Nepali Dataset: It is pre - trained on a dataset with over 13 million Nepali text sequences, enabling it to capture rich Nepali language features.
Masked Language Modeling: Trained with a masked language modeling (MLM) objective, which helps the model understand the context and semantics of Nepali text.

📚 Documentation

Model description

Refer to original distilbert - base - uncased

Intended uses & limitations

This backbone model intends to be fine - tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering. The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.

Training data

This model is trained on nepalitext language modeling dataset which combines the datasets: OSCAR, cc100 and a set of scraped Nepali articles on Wikipedia. As for training the language model, the texts in the training set are grouped to a block of 512 tokens.

Tokenization

A Sentence Piece Model (SPM) is trained on a subset of nepalitext dataset for text tokenization. The tokenizer is trained with vocab - size = 24576, min - frequency = 4, limit - alphabet = 1000 and model - max - length = 512.

Training procedure

The model is trained with the same configuration as the original distilbert - base - uncased; 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.

Training hyperparameters

The following hyperparameters were used for training of the final epoch: [ Refer to the Training results table below for varying hyperparameters every epoch ]

learning_rate: 5e - 05
train_batch_size: 28
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
num_epochs: 1
mixed_precision_training: Native AMP

Training results

The model is trained for 4 epochs with varying hyperparameters:

Training Loss	Epoch	MLM Probability	Train Batch Size	Step	Validation Loss	Perplexity
3.4477	1.0	15	26	38864	3.3067	27.2949
2.9451	2.0	15	28	35715	2.8238	16.8407
2.866	3.0	20	28	35715	2.7431	15.5351
2.7287	4.0	20	28	35715	2.6053	13.5353
2.6412	5.0	20	28	35715	2.5161	12.3802

Final model evaluated with MLM Probability of 15%:

Training Loss	Epoch	MLM Probability	Train Batch Size	Step	Validation Loss	Perplexity
-	-	15	-	-	2.3494	10.4791

Framework versions

Transformers 4.16.2
Pytorch 1.9.1
Datasets 1.18.3
Tokenizers 0.10.3

🔧 Technical Details

Evaluation Results

It achieves the following results on the evaluation set:

mlm probability	evaluation loss	evaluation perplexity
15%	2.349	10.479
20%	2.605	13.351

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご