roberta-base-mr Open-source Model - Supports Marathi Masked Language Modeling and Fine-tuning for Downstream Tasks

Roberta Base Mr

Developed by flax-community

A transformers model pre-trained on large-scale Marathi corpus using self-supervised learning, primarily for masked language modeling and downstream task fine-tuning

Large Language Model #Marathi text processing #News classification optimization #Dynamic masking training

Downloads 156

Release Time : 3/2/2022

Model Overview

Marathi model pre-trained with masked language modeling (MLM) objective based on RoBERTa architecture, suitable for sequence classification, token classification tasks

Model Features

Large-scale Marathi pre-training

Pre-trained on the mr subset of C4 multilingual dataset, containing 14 billion Marathi tokens

Dynamic masking mechanism

Unlike BERT, employs dynamic masking strategy during pre-training to enhance model generalization

Downstream task adaptability

Optimized for downstream tasks requiring whole-sentence understanding like sequence classification and token classification

Model Capabilities

Masked language modeling

Text classification

Sequence labeling

Use Cases

News classification

Marathi news headline classification

Classifying news headlines into 'state/entertainment/sports' categories

Test set accuracy 94.21%, outperforming iNLTK ULMFiT's 92.4%

IndicNLP news classification

Classifying news content into 'lifestyle/entertainment/sports' categories

Test set accuracy 97.48%, surpassing existing solutions

🚀 RoBERTa base model for Marathi language

This is a pre - trained model on the Marathi language using a masked language modeling (MLM) objective. RoBERTa was introduced in this paper and first released in this repository. We trained the RoBERTa model for the Marathi Language during the community week hosted by Huggingface 🤗 using JAX/Flax for NLP & CV jax.

✨ Features

Marathi RoBERTa is a transformers model pretrained on a large corpus of Marathi data in a self - supervised fashion.
It can be used for masked language modeling and is suitable for fine - tuning on downstream tasks such as sequence classification, token classification, or question answering.

🚀 Quick Start

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='flax - community/roberta - base - mr')
>>> unmasker("मोठी बातमी! उद्या दुपारी <mask> वाजता जाहीर होणार दहावीचा निकाल")
[{'score': 0.057209037244319916,'sequence': 'मोठी बातमी! उद्या दुपारी आठ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 2226,
  'token_str': 'आठ'},
 {'score': 0.02796074189245701,
  'sequence': 'मोठी बातमी! उद्या दुपारी २० वाजता जाहीर होणार दहावीचा निकाल',
  'token': 987,
  'token_str': '२०'},
 {'score': 0.017235398292541504,
  'sequence': 'मोठी बातमी! उद्या दुपारी नऊ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 4080,
  'token_str': 'नऊ'},
 {'score': 0.01691395975649357,
  'sequence': 'मोठी बातमी! उद्या दुपारी २१ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 1944,
  'token_str': '२१'},
 {'score': 0.016252165660262108,
  'sequence': 'मोठी बातमी! उद्या दुपारी  ३ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 549,
  'token_str': ' ३'}]

📦 Installation

No specific installation steps are provided in the original README.

📚 Documentation

Intended uses & limitations❗️

You can use the raw model for masked language modeling, but it's mostly intended to be fine - tuned on a downstream task. Note that this model is primarily aimed at being fine - tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering. Since the Marathi mc4 dataset is made by scraping Marathi newspapers text, it will involve some biases which will also affect all fine - tuned versions of this model.

Training data 🏋🏻‍♂️

The RoBERTa Marathi model was pretrained on the mr dataset of the C4 multilingual dataset: C4 (Colossal Clean Crawled Corpus), introduced by Raffel et al. in [Exploring the Limits of Transfer Learning with a Unified Text - to - Text Transformer](https://paperswithcode.com/paper/exploring - the - limits - of - transfer - learning).

The dataset can be downloaded in a pre - processed form from allennlp or Huggingface's datasets - mc4 dataset. The Marathi (mr) dataset consists of 14 billion tokens, 7.8 million docs, and has a text weight of ~70 GB.

Data Cleaning 🧹

Though the initial mc4 Marathi corpus size is ~70 GB, through data exploration, it was observed that it contains docs from different languages, especially Thai, Chinese, etc. So we had to clean the dataset before training the tokenizer and model.

Train set:

The number of clean docs is 1581396 out of 7774331. ~20.34% of the whole Marathi train split is actually Marathi.

Validation set

The number of clean docs is 1700 out of 7928. ~19.90% of the whole Marathi validation split is actually Marathi.

Training procedure 👨🏻‍💻

Preprocessing

The texts are tokenized using a byte version of Byte - Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s>. The details of the masking procedure for each sentence are the following:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by <mask>.
In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).

Pretraining

The model was trained on a Google Cloud Engine TPUv3 - 8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) with 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The optimizer used is Adam with a learning rate of 3e - 4, β1 = 0.9, β2 = 0.98, ε = 1e - 8, a weight decay of 0.01, learning rate warm - up for 1,000 steps, and linear decay of the learning rate after.

We tracked experiments and hyperparameter tuning on the weights and biases platform. Here is the link to the main dashboard: [Link to Weights and Biases Dashboard for Marathi RoBERTa model](https://wandb.ai/nipunsadvilkar/roberta - base - mr/runs/19qtskbg?workspace=user - nipunsadvilkar)

Pretraining Results 📊

The RoBERTa Model reached an eval accuracy of 85.28% around ~35K steps with a train loss of 0.6507 and an eval loss of 0.6219.

Fine Tuning on downstream tasks

We performed fine - tuning on downstream tasks. We used the following datasets for classification:

1. [IndicNLP Marathi news classification](https://github.com/ai4bharat - indicnlp/indicnlp_corpus#publicly - available - classification - datasets)

The IndicNLP Marathi news dataset consists of 3 classes - ['lifestyle', 'entertainment', 'sports'] - with the following document distribution per class:

train	eval	test
9672	477	478

💯 Our Marathi RoBERTa roberta - base - mr model outperformed both classifiers mentioned in [Arora, G. (2020). iNLTK](https://www.semanticscholar.org/paper/iNLTK%3A - Natural - Language - Toolkit - for - Indic - Languages - Arora/5039ed9e100d3a1cbbc25a02c82f6ee181609e83/figure/3) and [Kunchukuttan, Anoop et al. AI4Bharat - IndicNLP.](https://www.semanticscholar.org/paper/AI4Bharat - IndicNLP - Corpus%3A - Monolingual - Corpora - and - Kunchukuttan - Kakwani/7997d432925aff0ba05497d2893c09918298ca55/figure/4)

Dataset	FT - W	FT - WC	INLP	iNLTK	roberta - base - mr 🏆
iNLTK Headlines	83.06	81.65	89.92	92.4	97.48

🤗 Huggingface Model hub repo: The roberta - base - mr model fine - tuned on the iNLTK Headlines classification dataset:

[flax - community/mr - indicnlp - classifier](https://huggingface.co/flax - community/mr - indicnlp - classifier)

🧪 The link to the fine - tuning experiment's weight and biases dashboard is [here](https://wandb.ai/nipunsadvilkar/huggingface/runs/1242bike?workspace=user - nipunsadvilkar)

2. [iNLTK Marathi news headline classification](https://www.kaggle.com/disisbig/marathi - news - dataset)

This dataset consists of 3 classes - ['state', 'entertainment', 'sports'] - with the following document distribution per class:

train	eval	test
9658	1210	1210

💯 Here as well, roberta - base - mr outperformed the iNLTK Marathi news text classifier.

Dataset	iNLTK ULMFiT	roberta - base - mr 🏆
iNLTK news dataset (kaggle)	92.4	94.21

🤗 Huggingface Model hub repo: The roberta - base - mr model fine - tuned on the iNLTK news classification dataset:

[flax - community/mr - inltk - classifier](https://huggingface.co/flax - community/mr - inltk - classifier)

The link to the fine - tuning experiment's weight and biases dashboard is [here](https://wandb.ai/nipunsadvilkar/huggingface/runs/2u5l9hon?workspace=user - nipunsadvilkar)

Want to check how above models generalise on real world Marathi data?

Head to 🤗 Huggingface's spaces 🪐 to play with all three models:

Mask Language Modelling with Pretrained Marathi RoBERTa model: [flax - community/roberta - base - mr](https://huggingface.co/flax - community/roberta - base - mr)
Marathi Headline classifier: [flax - community/mr - indicnlp - classifier](https://huggingface.co/flax - community/mr - indicnlp - classifier)
Marathi news classifier: [flax - community/mr - inltk - classifier](https://huggingface.co/flax - community/mr - inltk - classifier)

![alt text](https://huggingface.co/docs/assets/hub/icon - space.svg) [Streamlit app of Pretrained Roberta Marathi model on Huggingface Spaces](https://huggingface.co/spaces/flax - community/roberta - base - mr)

![image](https://user-images.githubusercontent.com/15062408/126040832 - f5723875 - b70f - 4e2e - 93ad - 213ddbe6180d.png)

👥 Team Members

Nipun Sadvilkar @nipunsadvilkar
Haswanth Aekula @hassiahk

🙏 Credits

Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resources. Big thanks to [@patil - suraj](https://github.com/patil - suraj) & @patrickvonplaten for mentoring during the whole week.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご