ЁЯЪА RoBERTa trained on Sanskrit (SanBERTa)
SanBERTa is a RoBERTa model trained on Sanskrit, which can be used for various NLP tasks in the Sanskrit language.
ЁЯЪА Quick Start
The following sections will introduce the dataset, configuration, training, evaluation, and usage examples of SanBERTa.
тЬи Features
- Trained on Sanskrit: Specifically designed for Sanskrit language processing.
- Multiple Usage Scenarios: Can be used for embedding generation and masked prediction tasks.
ЁЯУж Installation
No specific installation steps are provided in the original document.
ЁЯТ╗ Usage Examples
Basic Usage
For Embeddings
tokenizer = AutoTokenizer.from_pretrained("surajp/SanBERTa")
model = RobertaModel.from_pretrained("surajp/SanBERTa")
op = tokenizer.encode("рдЗрдпрдВ рднрд╛рд╖рд╛ рди рдХреЗрд╡рд▓рдВ рднрд╛рд░рддрд╕реНрдп рдЕрдкрд┐ рддреБ рд╡рд┐рд╢реНрд╡рд╕реНрдп рдкреНрд░рд╛рдЪреАрдирддрдорд╛ рднрд╛рд╖рд╛ рдЗрддрд┐ рдордиреНрдпрддреЗред", return_tensors="pt")
ps = model(op)
ps[0].shape
'''
Output:
--------
torch.Size([1, 47, 768])
For Prediction
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="surajp/SanBERTa",
tokenizer="surajp/SanBERTa"
)
fill_mask("рдЗрдпрдВ рднрд╛рд╖рд╛ рди рдХреЗрд╡рд▓<mask> рднрд╛рд░рддрд╕реНрдп рдЕрдкрд┐ рддреБ рд╡рд┐рд╢реНрд╡рд╕реНрдп рдкреНрд░рд╛рдЪреАрдирддрдорд╛ рднрд╛рд╖рд╛ рдЗрддрд┐ рдордиреНрдпрддреЗред")
ps = model(torch.tensor(enc).unsqueeze(1))
print(ps[0].shape)
'''
Output:
--------
[{'score': 0.7516744136810303,
'sequence': '<s> рдЗрдпрдВ рднрд╛рд╖рд╛ рди рдХреЗрд╡рд▓рдВ рднрд╛рд░рддрд╕реНрдп рдЕрдкрд┐ рддреБ рд╡рд┐рд╢реНрд╡рд╕реНрдп рдкреНрд░рд╛рдЪреАрдирддрдорд╛ рднрд╛рд╖рд╛ рдЗрддрд┐ рдордиреНрдпрддреЗред</s>',
'token': 280,
'token_str': '├а┬д─д'},
{'score': 0.06230105459690094,
'sequence': '<s> рдЗрдпрдВ рднрд╛рд╖рд╛ рди рдХреЗрд╡рд▓реА рднрд╛рд░рддрд╕реНрдп рдЕрдкрд┐ рддреБ рд╡рд┐рд╢реНрд╡рд╕реНрдп рдкреНрд░рд╛рдЪреАрдирддрдорд╛ рднрд╛рд╖рд╛ рдЗрддрд┐ рдордиреНрдпрддреЗред</s>',
'token': 289,
'token_str': '├а┬е─в'},
{'score': 0.055410224944353104,
'sequence': '<s> рдЗрдпрдВ рднрд╛рд╖рд╛ рди рдХреЗрд╡рд▓рд╛ рднрд╛рд░рддрд╕реНрдп рдЕрдкрд┐ рддреБ рд╡рд┐рд╢реНрд╡рд╕реНрдп рдкреНрд░рд╛рдЪреАрдирддрдорд╛ рднрд╛рд╖рд╛ рдЗрддрд┐ рдордиреНрдпрддреЗред</s>',
'token': 265,
'token_str': '├а┬д┬╛'},
...]
ЁЯУЪ Documentation
Dataset
Configuration
Property |
Details |
num_attention_heads |
12 |
num_hidden_layers |
6 |
hidden_size |
768 |
vocab_size |
29407 |
Training
- Hardware: On TPU
- Task: For language modelling
- Strategy: Iteratively increasing
--block_size
from 128 to 256 over epochs
Evaluation
Metric |
Value |
Perplexity (block_size=256 ) |
4.04 |
ЁЯУД License
No license information is provided in the original document.
Citation
@misc{Parmar2020Sanberta,
author = {Parmar, Suraj},
title = {SanBERTa - a RoBERTa trained on Sanskrit},
year = {2020},
month = {Jun},
publisher = {Hugging Face Model Hub},
url = {https://huggingface.co/surajp/SanBERTa}
}
Created by Suraj Parmar/@parmarsuraj99 | LinkedIn
Made with тЩе in India