muril-large-cased Open-Source Multilingual Model - Supports 17 Indian Languages and Transliterated Text Processing

Muril Large Cased

Developed by google

A multilingual pre-trained model for Indian languages based on the BERT large architecture, covering 17 Indian languages and their transcribed versions

Large Language Model

Transformers

#Indian Multilingual Processing #Transcribed Text Optimization #Low-Resource Language Enhancement

Downloads 6,307

Release Time : 3/2/2022

Model Overview

MuRIL is a multilingual representation model optimized for Indian languages, enhancing performance on low-resource languages by integrating translation and transcribed data, suitable for NLP tasks in Indian languages

Model Features

Multilingual Transcription Optimization

Simultaneously trains original text and transcribed text pairs, specifically addressing common language transcription phenomena in India

Low-Resource Language Enhancement

Uses a 0.3 exponential upsampling strategy to significantly improve model performance on low-resource languages

Parallel Data Training

Integrates translation data (Google NMT) and transcription data (IndicTrans) for joint training

Model Capabilities

Multilingual Text Understanding

Cross-Language Transcription Processing

Named Entity Recognition

Text Classification

Question Answering System

Use Cases

Government Services

Multilingual Policy Document Analysis

Processes government documents in different Indian languages

Achieves an F1 score of 77.7% on the PANX task

Education

Cross-Language Educational Resource Processing

Automatically processes educational materials in different Indian languages

Improves F1 score by 3% on the TyDiQA task

🚀 MuRIL Large

A BERT Large (24L) model pre-trained on 17 Indian languages and their transliterated counterparts, offering multilingual representations for Indian languages.

🚀 Quick Start

This section provides an overview of the MuRIL Large model, including its architecture, training data, and usage scenarios. For more detailed information, please refer to the subsequent sections.

✨ Features

Multilingual Representation: Pre-trained on 17 Indian languages and their transliterated counterparts, enabling effective processing of multilingual data.
Modified Training Paradigm: Incorporates translation and transliteration segment pairs in training and uses an exponent value of 0.3 for upsampling to enhance low-resource performance.
Self-supervised Learning: Trained using a self-supervised masked language modeling task with whole word masking.

📦 Installation

No installation steps were provided in the original README.

💻 Usage Examples

No code examples were provided in the original README.

📚 Documentation

Overview

This model uses a BERT large architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4], and Dakshina [5] corpora for 17 [6] Indian languages.

We use a training paradigm similar to multilingual BERT, with a few modifications as listed:

We include translation and transliteration segment pairs in training as well.
We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7]

See the Training section for more details.

Training

The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below:

Monolingual Data: We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
Parallel Data: We have two types of parallel data:
- Translated Data: We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data: We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1500K steps, with a batch size of 8192, and a max sequence length of 512.

Trainable parameters

All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

Uses & Limitations

This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pre-training, i.e., 17 Indian languages.

Evaluation

We provide the results of fine-tuning this model on a set of downstream tasks. We choose these tasks from the XTREME benchmark, with evaluation done on Indian language test-sets. All results are computed in a zero-shot setting, with English being the high resource training set language. The results for XLM-R (Large) are taken from the XTREME paper [9].

Task	Dataset	XLM-R (large)	MuRIL (large)
PANX (F1)	bn, en, hi, ml, mr, ta, te, ur	68.0%	77.7%
UDPOS (F1)	en, hi, mr, ta, te, ur	79.2%	77.3%
XNLI (Accuracy)	en, hi, ur	78.7%	78.6%
XQUAD (F1/EM)	en, hi	81.6/67.7	83.3/70.1
MLQA (F1/EM)	en, hi	77.1/61.9	78.3/62.9
TyDiQA (F1/EM)	en, bn, te	68.5/49.4	71.5/56.6

The fine-tuning hyperparameters are as follows:

Task	Batch Size	Learning Rate	Epochs	Warm-up Ratio
PANX	32	2e-5	10	0.1
UDPOS	64	5e-6	10	0.1
XNLI	128	2e-5	5	0.1
XQuAD	32	3e-5	2	0.1
MLQA	32	3e-5	2	0.1
TyDiQA	32	3e-5	3	0.1

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
[2] Wikipedia
[3] Common Crawl
[4] PMINDIA
[5] Dakshina
[6] Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te), and Urdu (ur).
[7] Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).
[8] IndicTrans
[9] Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080.
[10] Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020). FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding. arXiv preprint arXiv:2009.05166.

Citation

If you find MuRIL useful in your applications, please cite the following paper:

@misc{khanuja2021muril,
      title={MuRIL: Multilingual Representations for Indian Languages},
      author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
      year={2021},
      eprint={2103.10730},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Please mail your queries/feedback to muril-contact@google.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご