IndicBERTv2-MLM-only Open-source Multilingual Model - Supports text processing in 23 Indian languages and English

Indicbertv2 MLM Only

Developed by ai4bharat

IndicBERT is a multilingual language model that supports 23 Indian languages and English, with 278 million parameters. It is trained on IndicCorp v2 and evaluated on the IndicXTREME benchmark.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual Hindi support #Fill-mask task #Training on large-scale corpora

Downloads 87.60k

Release Time : 11/13/2022

Model Overview

IndicBERT is a multilingual BERT-style model focused on Indian language processing. It is optimized through multiple training objectives and datasets and supports the fill-mask task.

Model Features

Multilingual support

Supports 23 Indian languages and English, covering multiple language families.

Multiple training objectives

Trained through multiple objectives such as MLM, TLM, and back-translation to improve model performance.

Optimization of vocabulary sharing

The IndicBERT-SS version promotes better vocabulary sharing between languages through script conversion.

Model Capabilities

Multilingual text understanding

Handling of fill-mask tasks

Cross-lingual transfer learning

Use Cases

Natural language understanding

Named entity recognition

Identify named entities in multiple Indian languages.

Sentiment analysis

Analyze the sentiment tendency of Indian language texts.

Machine translation assistance

Enhancement of parallel corpora

Improve the performance of machine translation models through TLM training.

🚀 IndicBERT

A multilingual language model trained on IndicCorp v2 and evaluated on the IndicXTREME benchmark, supporting 23 Indic languages and English.

📋 Model Information

Property	Details
Supported Languages	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
Language Details	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab
Tags	indicbert2, ai4bharat, multilingual
License	mit
Metrics	accuracy
Pipeline Tag	fill-mask

✨ Features

Multilingual Support: The model has 278M parameters and is available in 23 Indic languages and English.
Trained on Diverse Objectives and Datasets: The models are trained with various objectives and datasets, including IndicCorp v2 and Samanantar Parallel Corpus.

📦 Model Variants

IndicBERT-MLM [Model]: A vanilla BERT style model trained on IndicCorp v2 with the MLM objective.
- +Samanantar [Model]: TLM as an additional objective with Samanantar Parallel Corpus [Paper] | [Dataset].
- +Back-Translation [Model]: TLM as an additional objective by translating the Indic parts of IndicCorp v2 dataset into English w/ IndicTrans model [Model].
IndicBERT-SS [Model]: To encourage better lexical sharing among languages, we convert the scripts from Indic languages to Devanagari and train a BERT style model with the MLM objective.

🚀 Quick Start

Environment Setup

Fine-tuning scripts are based on the transformers library. Create a new conda environment and set it up as follows:

conda create -n finetuning python=3.9
pip install -r requirements.txt

Fine-tuning Command

All the tasks follow the same structure. Please check individual files for detailed hyper-parameter choices. The following command runs the fine-tuning for a task:

python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
    --model_name_or_path=$MODEL_NAME \
    --do_train

Arguments:

MODEL_NAME: Name of the model to fine-tune, can be a local path or a model from the HuggingFace Model Hub.
TASK_NAME: One of [ner, paraphrase, qa, sentiment, xcopa, xnli, flores].

⚠️ Important Note

For the MASSIVE task, please use the instruction provided in the official repository.

📄 Citation

@inproceedings{doddapaneni-etal-2023-towards,
    title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
    author = "Doddapaneni, Sumanth  and
      Aralikatte, Rahul  and
      Ramesh, Gowtham  and
      Goyal, Shreya  and
      Khapra, Mitesh M.  and
      Kunchukuttan, Anoop  and
      Kumar, Pratyush",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.693",
    doi = "10.18653/v1/2023.acl-long.693",
    pages = "12402--12426",
    abstract = "Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご