RoBERTa-hindi-guj-san Open-source Model - Free Support for Hindi, Sanskrit, and Gujarati Processing

Roberta Hindi Guj San

Developed by surajp

A multilingual RoBERTa-style model trained on Hindi, Sanskrit, and Gujarati Wikipedia articles, supporting processing for three Indo-Aryan languages.

Large Language Model OtherOpen Source License:MIT #Indo-Aryan multilingual #Wikipedia pretraining #Cross-lingual transfer learning

Downloads 51

Release Time : 3/2/2022

Model Overview

This model employs a phased training strategy, first pretrained on Hindi and then fine-tuned on mixed Sanskrit and Gujarati texts, aiming to enhance multilingual processing capabilities by leveraging linguistic similarities.

Model Features

Multilingual joint training

Achieves joint modeling of three Indo-Aryan languages through shared vocabulary and phased training strategy

Transfer learning optimization

Pretrained on Hindi first, then fine-tuned on other languages to enhance performance using linguistic similarities

Efficient tokenizer

Unified tokenizer trained on merged texts, supporting mixed-language processing for all three languages

Model Capabilities

Text infilling

Language modeling

Multilingual text understanding

Use Cases

Education

Gujarati grammar checking

Automatically detects and corrects syntactic errors in Gujarati sentences

Examples show accurate prediction of missing sentence components

Cultural preservation

Sanskrit ancient text digitization

Assists in machine processing and understanding of ancient Sanskrit literature

🚀 RoBERTa-hindi-guj-san

This multilingual RoBERTa-like model is trained on Wikipedia articles in Hindi, Sanskrit, and Gujarati. It aims to handle these related languages effectively, leveraging the pre - training on Hindi to learn similar linguistic patterns in Sanskrit and Gujarati.

🚀 Quick Start

How to use

# Example usage
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline

tokenizer = AutoTokenizer.from_pretrained("surajp/RoBERTa-hindi-guj-san")
model = AutoModelWithLMHead.from_pretrained("surajp/RoBERTa-hindi-guj-san")

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

# Sanskrit: इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
# Hindi:  अगर आप अब अभ्यास नहीं करते हो तो आप अपने परीक्षा में मूर्खतापूर्ण गलतियाँ करोगे।
# Gujarati: ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો <mask> હતો.
fill_mask("ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો <mask> હતો.")

'''
Output:
--------
[
{'score': 0.07849744707345963, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો જ હતો.</s>', 'token': 390},
{'score': 0.06273336708545685, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો ન હતો.</s>', 'token': 478},
{'score': 0.05160355195403099, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો થઇ હતો.</s>', 'token': 2075},
{'score': 0.04751499369740486, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો એક હતો.</s>', 'token': 600},
{'score': 0.03788900747895241, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો પણ હતો.</s>', 'token': 840}
]

✨ Features

Multilingual support for Hindi, Sanskrit, and Gujarati.
Leverage pre - training on Hindi to fine - tune on Sanskrit and Gujarati.

📚 Documentation

Model description

A multilingual RoBERTa-like model trained on Wikipedia articles of Hindi, Sanskrit, and Gujarati languages. The tokenizer was trained on combined text. Hindi text was used for pre - training the model, and then it was fine - tuned on combined Sanskrit and Gujarati text, expecting that pre - training with Hindi would help the model learn similar languages.

Configuration

Property	Details
`hidden_size`	768
`num_attention_heads`	12
`num_hidden_layers`	6
`vocab_size`	30522
`model_type`	`roberta`

📦 Installation

No specific installation steps are provided in the original document.

📄 License

This project is licensed under the MIT license.

🔧 Technical Details

Training data

Cleaned Wikipedia articles in Hindi, Sanskrit, and Gujarati on Kaggle are used for training. It contains both training and evaluation text and is also used in iNLTK.

Training procedure

The model is trained on TPU (using xla_spawn.py).
It is trained for language modelling.
The --block_size is iteratively increased from 128 to 256 over epochs.
The tokenizer is trained on combined text.
The model is pre - trained with Hindi and fine - tuned on Sanskrit and Gujarati texts.

--model_type distillroberta-base \
--model_name_or_path "/content/SanHiGujBERTa" \
--mlm_probability 0.20 \
--line_by_line \
--save_total_limit 2 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 128 \
--num_train_epochs 5 \
--block_size 256 \
--seed 108 \
--overwrite_output_dir \

Eval results

The perplexity of the model is 2.920005983224673.

Created by Suraj Parmar/@parmarsuraj99 | LinkedIn

Made with ♥ in India

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご