๐ RoBERTa-hindi-guj-san
This multilingual RoBERTa-like model is trained on Wikipedia articles in Hindi, Sanskrit, and Gujarati. It aims to handle these related languages effectively, leveraging the pre - training on Hindi to learn similar linguistic patterns in Sanskrit and Gujarati.
๐ Quick Start
How to use
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
tokenizer = AutoTokenizer.from_pretrained("surajp/RoBERTa-hindi-guj-san")
model = AutoModelWithLMHead.from_pretrained("surajp/RoBERTa-hindi-guj-san")
fill_mask = pipeline(
"fill-mask",
model=model,
tokenizer=tokenizer
)
fill_mask("เชเซเชเชฐเชพเชคเชฎเชพเช เซงเซฏเชฎเซ เชฎเชพเชฐเซเช เชธเซเชงเซ เชเซเช เชธเชเชพเชฐเชพเชคเซเชฎเช (เชชเซเชเซเชเซเชต) เชฐเซเชชเซเชฐเซเช เชเชตเซเชฏเซ <mask> เชนเชคเซ.")
'''
Output:
--------
[
{'score': 0.07849744707345963, 'sequence': '<s> เชเซเชเชฐเชพเชคเชฎเชพเช เซงเซฏเชฎเซ เชฎเชพเชฐเซเช เชธเซเชงเซ เชเซเช เชธเชเชพเชฐเชพเชคเซเชฎเช (เชชเซเชเซเชเซเชต) เชฐเซเชชเซเชฐเซเช เชเชตเซเชฏเซ เช เชนเชคเซ.</s>', 'token': 390},
{'score': 0.06273336708545685, 'sequence': '<s> เชเซเชเชฐเชพเชคเชฎเชพเช เซงเซฏเชฎเซ เชฎเชพเชฐเซเช เชธเซเชงเซ เชเซเช เชธเชเชพเชฐเชพเชคเซเชฎเช (เชชเซเชเซเชเซเชต) เชฐเซเชชเซเชฐเซเช เชเชตเซเชฏเซ เชจ เชนเชคเซ.</s>', 'token': 478},
{'score': 0.05160355195403099, 'sequence': '<s> เชเซเชเชฐเชพเชคเชฎเชพเช เซงเซฏเชฎเซ เชฎเชพเชฐเซเช เชธเซเชงเซ เชเซเช เชธเชเชพเชฐเชพเชคเซเชฎเช (เชชเซเชเซเชเซเชต) เชฐเซเชชเซเชฐเซเช เชเชตเซเชฏเซ เชฅเช เชนเชคเซ.</s>', 'token': 2075},
{'score': 0.04751499369740486, 'sequence': '<s> เชเซเชเชฐเชพเชคเชฎเชพเช เซงเซฏเชฎเซ เชฎเชพเชฐเซเช เชธเซเชงเซ เชเซเช เชธเชเชพเชฐเชพเชคเซเชฎเช (เชชเซเชเซเชเซเชต) เชฐเซเชชเซเชฐเซเช เชเชตเซเชฏเซ เชเช เชนเชคเซ.</s>', 'token': 600},
{'score': 0.03788900747895241, 'sequence': '<s> เชเซเชเชฐเชพเชคเชฎเชพเช เซงเซฏเชฎเซ เชฎเชพเชฐเซเช เชธเซเชงเซ เชเซเช เชธเชเชพเชฐเชพเชคเซเชฎเช (เชชเซเชเซเชเซเชต) เชฐเซเชชเซเชฐเซเช เชเชตเซเชฏเซ เชชเชฃ เชนเชคเซ.</s>', 'token': 840}
]
โจ Features
- Multilingual support for Hindi, Sanskrit, and Gujarati.
- Leverage pre - training on Hindi to fine - tune on Sanskrit and Gujarati.
๐ Documentation
Model description
A multilingual RoBERTa-like model trained on Wikipedia articles of Hindi, Sanskrit, and Gujarati languages. The tokenizer was trained on combined text. Hindi text was used for pre - training the model, and then it was fine - tuned on combined Sanskrit and Gujarati text, expecting that pre - training with Hindi would help the model learn similar languages.
Configuration
Property |
Details |
hidden_size |
768 |
num_attention_heads |
12 |
num_hidden_layers |
6 |
vocab_size |
30522 |
model_type |
roberta |
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ License
This project is licensed under the MIT license.
๐ง Technical Details
Training data
Cleaned Wikipedia articles in Hindi, Sanskrit, and Gujarati on Kaggle are used for training. It contains both training and evaluation text and is also used in iNLTK.
Training procedure
- The model is trained on TPU (using
xla_spawn.py
).
- It is trained for language modelling.
- The
--block_size
is iteratively increased from 128 to 256 over epochs.
- The tokenizer is trained on combined text.
- The model is pre - trained with Hindi and fine - tuned on Sanskrit and Gujarati texts.
--model_type distillroberta-base \
--model_name_or_path "/content/SanHiGujBERTa" \
--mlm_probability 0.20 \
--line_by_line \
--save_total_limit 2 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 128 \
--num_train_epochs 5 \
--block_size 256 \
--seed 108 \
--overwrite_output_dir \
Eval results
The perplexity of the model is 2.920005983224673.
Created by Suraj Parmar/@parmarsuraj99 | LinkedIn
Made with โฅ in India