Model Overview
Model Features
Model Capabilities
Use Cases
ЁЯЪА distilbert-base-nepali
This model is pre - trained on the nepalitext dataset, which contains over 13 million Nepali text sequences. It uses a masked language modeling (MLM) objective. We train a Sentence Piece Model (SPM) for text tokenization, similar to XLM - ROBERTa, and train the distilbert model for language modeling. More details can be found in this paper.
ЁЯЪА Quick Start
This model can be used directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
>>> unmasker("рдорд╛рдирд╡рд┐рдп рдЧрддрд┐рд╡рд┐рдзрд┐рд▓реЗ рдкреНрд░рд╛рддреГрддрд┐рдХ рдкрд░реНрдпрд╛рд╡рд░рди рдкреНрд░рдирд╛рд▓реАрд▓рд╛рдИ рдЕрдкрд░рд┐рдореЗрдп рдХреНрд╖рддрд┐ рдкреБреНрд░реНрдпрд╛рдПрдХреЛ рдЫред рдкрд░рд┐рд╡рд░реНрддрдирд╢рд┐рд▓ рдЬрд▓рд╡рд╛рдпреБрд▓реЗ рдЦрд╛рдз, рд╕реБрд░рдХреНрд╖рд╛, <mask>, рдЬрдорд┐рди, рдореМрд╕рдорд▓рдЧрд╛рдпрддрд▓рд╛рдИ рдЕрд╕рдВрдЦреНрдп рддрд░рд┐рдХрд╛рд▓реЗ рдкреНрд░рднрд╛рд╡рд┐рдд рдЫред")
[{'score': 0.04128897562623024,
'sequence': 'рдорд╛рдирд╡рд┐рдп рдЧрддрд┐рд╡рд┐рдзрд┐рд▓реЗ рдкреНрд░рд╛рддреГрддрд┐рдХ рдкрд░реНрдпрд╛рд╡рд░рди рдкреНрд░рдирд╛рд▓реАрд▓рд╛рдИ рдЕрдкрд░рд┐рдореЗрдп рдХреНрд╖рддрд┐ рдкреБреНрд░реНрдпрд╛рдПрдХреЛ рдЫред рдкрд░рд┐рд╡рд░реНрддрдирд╢рд┐рд▓ рдЬрд▓рд╡рд╛рдпреБрд▓реЗ рдЦрд╛рдз, рд╕реБрд░рдХреНрд╖рд╛, рдореМрд╕рдо, рдЬрдорд┐рди, рдореМрд╕рдорд▓рдЧрд╛рдпрддрд▓рд╛рдИ рдЕрд╕рдВрдЦреНрдп рддрд░рд┐рдХрд╛рд▓реЗ рдкреНрд░рднрд╛рд╡рд┐рдд рдЫред',
'token': 2605,
'token_str': 'рдореМрд╕рдо'},
{'score': 0.04100276157259941,
'sequence': 'рдорд╛рдирд╡рд┐рдп рдЧрддрд┐рд╡рд┐рдзрд┐рд▓реЗ рдкреНрд░рд╛рддреГрддрд┐рдХ рдкрд░реНрдпрд╛рд╡рд░рди рдкреНрд░рдирд╛рд▓реАрд▓рд╛рдИ рдЕрдкрд░рд┐рдореЗрдп рдХреНрд╖рддрд┐ рдкреБреНрд░реНрдпрд╛рдПрдХреЛ рдЫред рдкрд░рд┐рд╡рд░реНрддрдирд╢рд┐рд▓ рдЬрд▓рд╡рд╛рдпреБрд▓реЗ рдЦрд╛рдз, рд╕реБрд░рдХреНрд╖рд╛, рдкреНрд░рдХреГрддрд┐, рдЬрдорд┐рди, рдореМрд╕рдорд▓рдЧрд╛рдпрддрд▓рд╛рдИ рдЕрд╕рдВрдЦреНрдп рддрд░рд┐рдХрд╛рд▓реЗ рдкреНрд░рднрд╛рд╡рд┐рдд рдЫред',
'token': 2792,
'token_str': 'рдкреНрд░рдХреГрддрд┐'},
{'score': 0.026525357738137245,
'sequence': 'рдорд╛рдирд╡рд┐рдп рдЧрддрд┐рд╡рд┐рдзрд┐рд▓реЗ рдкреНрд░рд╛рддреГрддрд┐рдХ рдкрд░реНрдпрд╛рд╡рд░рди рдкреНрд░рдирд╛рд▓реАрд▓рд╛рдИ рдЕрдкрд░рд┐рдореЗрдп рдХреНрд╖рддрд┐ рдкреБреНрд░реНрдпрд╛рдПрдХреЛ рдЫред рдкрд░рд┐рд╡рд░реНрддрдирд╢рд┐рд▓ рдЬрд▓рд╡рд╛рдпреБрд▓реЗ рдЦрд╛рдз, рд╕реБрд░рдХреНрд╖рд╛, рдкрд╛рдиреА, рдЬрдорд┐рди, рдореМрд╕рдорд▓рдЧрд╛рдпрддрд▓рд╛рдИ рдЕрд╕рдВрдЦреНрдп рддрд░рд┐рдХрд╛рд▓реЗ рдкреНрд░рднрд╛рд╡рд┐рдд рдЫред',
'token': 387,
'token_str': 'рдкрд╛рдиреА'},
{'score': 0.02340106852352619,
'sequence': 'рдорд╛рдирд╡рд┐рдп рдЧрддрд┐рд╡рд┐рдзрд┐рд▓реЗ рдкреНрд░рд╛рддреГрддрд┐рдХ рдкрд░реНрдпрд╛рд╡рд░рди рдкреНрд░рдирд╛рд▓реАрд▓рд╛рдИ рдЕрдкрд░рд┐рдореЗрдп рдХреНрд╖рддрд┐ рдкреБреНрд░реНрдпрд╛рдПрдХреЛ рдЫред рдкрд░рд┐рд╡рд░реНрддрдирд╢рд┐рд▓ рдЬрд▓рд╡рд╛рдпреБрд▓реЗ рдЦрд╛рдз, рд╕реБрд░рдХреНрд╖рд╛, рдЬрд▓, рдЬрдорд┐рди, рдореМрд╕рдорд▓рдЧрд╛рдпрддрд▓рд╛рдИ рдЕрд╕рдВрдЦреНрдп рддрд░рд┐рдХрд╛рд▓реЗ рдкреНрд░рднрд╛рд╡рд┐рдд рдЫред',
'token': 1313,
'token_str': 'рдЬрд▓'},
{'score': 0.02055591531097889,
'sequence': 'рдорд╛рдирд╡рд┐рдп рдЧрддрд┐рд╡рд┐рдзрд┐рд▓реЗ рдкреНрд░рд╛рддреГрддрд┐рдХ рдкрд░реНрдпрд╛рд╡рд░рди рдкреНрд░рдирд╛рд▓реАрд▓рд╛рдИ рдЕрдкрд░рд┐рдореЗрдп рдХреНрд╖рддрд┐ рдкреБреНрд░реНрдпрд╛рдПрдХреЛ рдЫред рдкрд░рд┐рд╡рд░реНрддрдирд╢рд┐рд▓ рдЬрд▓рд╡рд╛рдпреБрд▓реЗ рдЦрд╛рдз, рд╕реБрд░рдХреНрд╖рд╛, рд╡рд╛рддрд╛рд╡рд░рдг, рдЬрдорд┐рди, рдореМрд╕рдорд▓рдЧрд╛рдпрддрд▓рд╛рдИ рдЕрд╕рдВрдЦреНрдп рддрд░рд┐рдХрд╛рд▓реЗ рдкреНрд░рднрд╛рд╡рд┐рдд рдЫред',
'token': 790,
'token_str': 'рд╡рд╛рддрд╛рд╡рд░рдг'}]
Here is how we can use the model to get the features of a given text in PyTorch:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')
# prepare input
text = "рдЪрд╛рд╣рд┐рдПрдХреЛ text рдпрддрд╛ рд░рд╛рдЦреНрдиреБ рд╣реЛрд▓рд╛ред"
encoded_input = tokenizer(text, return_tensors='pt')
# forward pass
output = model(**encoded_input)
тЬи Features
- Pre - trained on Large Nepali Dataset: It is pre - trained on a dataset with over 13 million Nepali text sequences, enabling it to capture rich Nepali language features.
- Masked Language Modeling: Trained with a masked language modeling (MLM) objective, which helps the model understand the context and semantics of Nepali text.
ЁЯУЪ Documentation
Model description
Refer to original distilbert - base - uncased
Intended uses & limitations
This backbone model intends to be fine - tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering. The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
Training data
This model is trained on nepalitext language modeling dataset which combines the datasets: OSCAR, cc100 and a set of scraped Nepali articles on Wikipedia. As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
Tokenization
A Sentence Piece Model (SPM) is trained on a subset of nepalitext dataset for text tokenization. The tokenizer is trained with vocab - size = 24576, min - frequency = 4, limit - alphabet = 1000 and model - max - length = 512.
Training procedure
The model is trained with the same configuration as the original distilbert - base - uncased; 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
Training hyperparameters
The following hyperparameters were used for training of the final epoch: [ Refer to the Training results table below for varying hyperparameters every epoch ]
- learning_rate: 5e - 05
- train_batch_size: 28
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- num_epochs: 1
- mixed_precision_training: Native AMP
Training results
The model is trained for 4 epochs with varying hyperparameters:
Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
---|---|---|---|---|---|---|
3.4477 | 1.0 | 15 | 26 | 38864 | 3.3067 | 27.2949 |
2.9451 | 2.0 | 15 | 28 | 35715 | 2.8238 | 16.8407 |
2.866 | 3.0 | 20 | 28 | 35715 | 2.7431 | 15.5351 |
2.7287 | 4.0 | 20 | 28 | 35715 | 2.6053 | 13.5353 |
2.6412 | 5.0 | 20 | 28 | 35715 | 2.5161 | 12.3802 |
Final model evaluated with MLM Probability of 15%:
Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
---|---|---|---|---|---|---|
- | - | 15 | - | - | 2.3494 | 10.4791 |
Framework versions
- Transformers 4.16.2
- Pytorch 1.9.1
- Datasets 1.18.3
- Tokenizers 0.10.3
ЁЯФз Technical Details
Evaluation Results
It achieves the following results on the evaluation set:
mlm probability | evaluation loss | evaluation perplexity |
---|---|---|
15% | 2.349 | 10.479 |
20% | 2.605 | 13.351 |
ЁЯУД License
This model is licensed under the Apache 2.0 license.

