๐ Gujarati-XLM-R-Base
This model is fine-tuned on the Gujarati language based on the XLM-RoBERTa base model, leveraging transfer learning and the OSCAR monolingual dataset to provide useful features for Gujarati NLP tasks.
๐ Quick Start
This model is finetuned over XLM-RoBERTa (XLM-R) using its base variant with the Gujarati language using the OSCAR monolingual dataset. We used the same masked language modelling (MLM) objective which was used for pretraining the XLM-R. As it is built over the pretrained XLM-R, we leveraged Transfer Learning by exploiting the knowledge from its parent model.
โจ Features
- This model can be used for further finetuning for different NLP tasks using the Gujarati language.
- It can be used to generate contextualised word representations for the Gujarati words.
- It can be used for domain adaptation.
- It can be used to predict the missing words from the Gujarati sentences.
๐ฆ Installation
Please visit this link for the detailed installation and preprocessing procedure.
๐ป Usage Examples
Basic Usage
Using the model to predict missing words
from transformers import pipeline
unmasker = pipeline('fill-mask', model='ashwani-tanwar/Gujarati-XLM-R-Base')
pred_word = unmasker("เช
เชฎเชฆเชพเชตเชพเชฆ เช เชเซเชเชฐเชพเชคเชจเซเช เชเช <mask> เชเซ.")
print(pred_word)
The output will be like:
[{'sequence': '<s> เช
เชฎเชฆเชพเชตเชพเชฆ เช เชเซเชเชฐเชพเชคเชจเซเช เชเช เชถเชนเซเชฐ เชเซ.</s>', 'score': 0.9463568329811096, 'token': 85227, 'token_str': 'โเชถเชนเซเชฐ'},
{'sequence': '<s> เช
เชฎเชฆเชพเชตเชพเชฆ เช เชเซเชเชฐเชพเชคเชจเซเช เชเช เชเชพเชฎ เชเซ.</s>', 'score': 0.013311690650880337, 'token': 66346, 'token_str': 'โเชเชพเชฎ'},
{'sequence': '<s> เช
เชฎเชฆเชพเชตเชพเชฆ เช เชเซเชเชฐเชพเชคเชจเซเช เชเชเชจเชเชฐ เชเซ.</s>', 'score': 0.012945962138473988, 'token': 69702, 'token_str': 'เชจเชเชฐ'},
{'sequence': '<s> เช
เชฎเชฆเชพเชตเชพเชฆ เช เชเซเชเชฐเชพเชคเชจเซเช เชเช เชธเซเชฅเชณ เชเซ.</s>', 'score': 0.0045941537246108055, 'token': 135436, 'token_str': 'โเชธเซเชฅเชณ'},
{'sequence': '<s> เช
เชฎเชฆเชพเชตเชพเชฆ เช เชเซเชเชฐเชพเชคเชจเซเช เชเช เชฎเชนเชคเซเชต เชเซ.</s>', 'score': 0.00402021361514926, 'token': 126763, 'token_str': 'โเชฎเชนเชคเซเชต'}]
Using the model to generate contextualised word representations
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ashwani-tanwar/Gujarati-XLM-R-Base")
model = AutoModel.from_pretrained("ashwani-tanwar/Gujarati-XLM-R-Base")
sentence = "เช
เชฎเชฆเชพเชตเชพเชฆ เช เชเซเชเชฐเชพเชคเชจเซเช เชเช เชถเชนเซเชฐ เชเซ."
encoded_sentence = tokenizer(sentence, return_tensors='pt')
context_word_rep = model(**encoded_sentence)
๐ Documentation
Dataset
The OSCAR corpus contains several diverse datasets for different languages. We followed the work of CamemBERT who reported better performance with this diverse dataset as compared to the other large homogenous datasets.
Preprocessing and Training Procedure
Please visit this link for the detailed procedure.