Gujarati-XLM-R-Base Open-Source Model - Free Deployment to Boost Gujarati Natural Language Processing

Gujarati XLM R Base

Developed by ashwani-tanwar

This model is based on the base variant of XLM-RoBERTa, fine-tuned using Gujarati language and OSCAR monolingual datasets, suitable for Gujarati natural language processing tasks.

Large Language Model

Transformers

Other#Gujarati language processing #Masked language modeling #Multilingual transfer learning

Downloads 22

Release Time : 3/2/2022

Model Overview

The model fine-tunes XLM-R on Gujarati through transfer learning, capable of generating context-dependent word representations, predicting missing words, and further fine-tuning for other NLP tasks.

Model Features

Transfer learning

Fine-tuned based on the pre-trained XLM-R model, leveraging the knowledge of its parent model to enhance performance.

Diverse dataset

Utilizes the Gujarati dataset from the OSCAR corpus, which includes diverse data across multiple languages, outperforming homogeneous datasets.

Model Capabilities

Generate context-dependent word representations for Gujarati words

Predict missing words in Gujarati sentences

Support fine-tuning for Gujarati natural language processing tasks

Use Cases

Natural Language Processing

Missing word prediction

Predict missing words in Gujarati sentences, such as 'શહેર' (city) in the example.

Prediction accuracy as high as 94.6%

Word representation generation

Generate context-dependent word representations for Gujarati words, useful for downstream tasks.

🚀 Gujarati-XLM-R-Base

This model is fine-tuned on the Gujarati language based on the XLM-RoBERTa base model, leveraging transfer learning and the OSCAR monolingual dataset to provide useful features for Gujarati NLP tasks.

🚀 Quick Start

This model is finetuned over XLM-RoBERTa (XLM-R) using its base variant with the Gujarati language using the OSCAR monolingual dataset. We used the same masked language modelling (MLM) objective which was used for pretraining the XLM-R. As it is built over the pretrained XLM-R, we leveraged Transfer Learning by exploiting the knowledge from its parent model.

✨ Features

This model can be used for further finetuning for different NLP tasks using the Gujarati language.
It can be used to generate contextualised word representations for the Gujarati words.
It can be used for domain adaptation.
It can be used to predict the missing words from the Gujarati sentences.

📦 Installation

Please visit this link for the detailed installation and preprocessing procedure.

💻 Usage Examples

Basic Usage

Using the model to predict missing words

from transformers import pipeline
unmasker = pipeline('fill-mask', model='ashwani-tanwar/Gujarati-XLM-R-Base')
pred_word = unmasker("અમદાવાદ એ ગુજરાતનું એક <mask> છે.")
print(pred_word)

The output will be like:

[{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક શહેર છે.</s>', 'score': 0.9463568329811096, 'token': 85227, 'token_str': '▁શહેર'}, 
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક ગામ છે.</s>', 'score': 0.013311690650880337, 'token': 66346, 'token_str': '▁ગામ'}, 
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એકનગર છે.</s>', 'score': 0.012945962138473988, 'token': 69702, 'token_str': 'નગર'}, 
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક સ્થળ છે.</s>', 'score': 0.0045941537246108055, 'token': 135436, 'token_str': '▁સ્થળ'}, 
{'sequence': '<s> અમદાવાદ એ ગુજરાતનું એક મહત્વ છે.</s>', 'score': 0.00402021361514926, 'token': 126763, 'token_str': '▁મહત્વ'}]

Using the model to generate contextualised word representations

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ashwani-tanwar/Gujarati-XLM-R-Base")
model = AutoModel.from_pretrained("ashwani-tanwar/Gujarati-XLM-R-Base")
sentence = "અમદાવાદ એ ગુજરાતનું એક શહેર છે."
encoded_sentence = tokenizer(sentence, return_tensors='pt')
context_word_rep = model(**encoded_sentence)

📚 Documentation

Dataset

The OSCAR corpus contains several diverse datasets for different languages. We followed the work of CamemBERT who reported better performance with this diverse dataset as compared to the other large homogenous datasets.

Preprocessing and Training Procedure

Please visit this link for the detailed procedure.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご