Model Overview
Model Features
Model Capabilities
Use Cases
đ MuRIL: Multilingual Representations for Indian Languages
MuRIL is a BERT model pre - trained on 17 Indian languages and their transliterated counterparts. It offers masked word prediction capabilities and an encoder on TFHub, providing valuable multilingual representation for Indian languages.
đ Quick Start
MuRIL is a BERT model pre - trained on 17 Indian languages and their transliterated counterparts. In this repository, we've released the pre - trained model with the MLM layer intact, which enables masked word predictions. We've also made the encoder available on TFHub, along with an additional pre - processing module that transforms raw text into the expected input format for the encoder. You can find more details about MuRIL in this paper.
⨠Features
- Multilingual Pretraining: The model uses a BERT base architecture pretrained from scratch using Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora for 17 Indian languages.
- Modified Training Paradigm: Similar to multilingual BERT but with modifications. It includes translation and transliteration segment pairs in training and uses an exponent value of 0.3 for upsampling to enhance low - resource performance.
đ Documentation
Overview
This model employs a BERT base architecture [1] that is pretrained from scratch. The training data sources are the Wikipedia [2], Common Crawl [3], PMINDIA [4], and Dakshina [5] corpora for 17 [6] Indian languages.
The training paradigm is similar to multilingual BERT, with the following modifications:
- Inclusion of Segment Pairs: We include translation and transliteration segment pairs in training.
- Upsampling Exponent: We use an exponent value of 0.3 (instead of 0.7) for upsampling, which has been shown to enhance low - resource performance [7].
Training
The MuRIL model is pre - trained on both monolingual and parallel segments:
- Monolingual Data: We utilize publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
- Parallel Data:
- Translated Data: We obtain translations of the above monolingual corpora using the Google NMT pipeline and also use the publicly available PMINDIA corpus.
- Transliterated Data: We obtain transliterations of Wikipedia using the IndicTrans [8] library and also use the publicly available Dakshina dataset.
We use an exponent value of 0.3 to calculate duplication multiplier values for upsampling lower - resourced languages and set dupe factors accordingly. Note that we limit transliterated pairs to Wikipedia only.
The model was trained using a self - supervised masked language modeling task with whole word masking (maximum of 80 predictions). It was trained for 1000K steps, with a batch size of 4096 and a max sequence length of 512.
Trainable parameters
All parameters in the module are trainable, and fine - tuning all parameters is the recommended practice.
Uses & Limitations
This model is designed for a variety of downstream NLP tasks for Indian languages. It is trained on transliterated data, which is common in the Indian context. However, it may not perform well on languages other than the 17 Indian languages used in pretraining.
Evaluation
We provide the results of fine - tuning this model on a set of downstream tasks from the XTREME benchmark, evaluated on Indian language test - sets and their transliterated versions.
Results on XTREME benchmark datasets (in %)
Task | Metric | en | hi | mr | ta | te | ur | bn | ml | Average |
---|---|---|---|---|---|---|---|---|---|---|
PANX | F1 | 84.40 (mBERT), 84.43 (MuRIL) | 65.13 (mBERT), 78.09 (MuRIL) | 58.44 (mBERT), 74.63 (MuRIL) | 51.24 (mBERT), 71.86 (MuRIL) | 50.16 (mBERT), 64.99 (MuRIL) | 31.36 (mBERT), 85.07 (MuRIL) | 68.59 (mBERT), 85.97 (MuRIL) | 54.77 (mBERT), 75.74 (MuRIL) | 58.01 (mBERT), 77.60 (MuRIL) |
UDPOS | F1 | 95.35 (mBERT), 95.55 (MuRIL) | 66.09 (mBERT), 64.47 (MuRIL) | 71.27 (mBERT), 82.95 (MuRIL) | 59.58 (mBERT), 62.57 (MuRIL) | 76.98 (mBERT), 85.63 (MuRIL) | 57.85 (mBERT), 58.93 (MuRIL) | - | - | 71.19 (mBERT), 75.02 (MuRIL) |
XNLI | Accuracy | 81.72 (mBERT), 83.85 (MuRIL) | 60.52 (mBERT), 70.66 (MuRIL) | - | - | - | 58.20 (mBERT), 67.70 (MuRIL) | - | - | 66.81 (mBERT), 74.07 (MuRIL) |
Tatoeba | Accuracy | - | 27.80 (mBERT), 31.50 (MuRIL) | 18.00 (mBERT), 26.60 (MuRIL) | 12.38 (mBERT), 36.81 (MuRIL) | 14.96 (mBERT), 17.52 (MuRIL) | 22.70 (mBERT), 17.10 (MuRIL) | 12.80 (mBERT), 20.20 (MuRIL) | 20.23 (mBERT), 26.35 (MuRIL) | 18.41 (mBERT), 25.15 (MuRIL) |
XQUAD | F1/EM | 83.85/72.86 (mBERT), 84.31/72.94 (MuRIL) | 58.46/43.53 (mBERT), 73.93/58.32 (MuRIL) | - | - | - | - | - | - | 71.15/58.19 (mBERT), 79.12/65.63 (MuRIL) |
MLQA | F1/EM | 80.39/67.30 (mBERT), 80.28/67.37 (MuRIL) | 50.28/35.18 (mBERT), 67.34/50.22 (MuRIL) | - | - | - | - | - | - | 65.34/51.24 (mBERT), 73.81/58.80 (MuRIL) |
TyDiQA | F1/EM | 75.21/65.00 (mBERT), 74.10/64.55 (MuRIL) | 60.62/45.13 (mBERT), 78.03/66.37 (MuRIL) | - | - | 53.55/44.54 (mBERT), 73.95/46.94 (MuRIL) | - | - | - | 63.13/51.66 (mBERT), 75.36/59.28 (MuRIL) |
Results on transliterated test - sets
Task | Metric | ml_tr | ta_tr | te_tr | bn_tr | hi_tr | mr_tr | ur_tr | Average |
---|---|---|---|---|---|---|---|---|---|
PANX | F1 | 7.53 (mBERT), 63.39 (MuRIL) | 1.04 (mBERT), 7.00 (MuRIL) | 8.24 (mBERT), 53.62 (MuRIL) | 41.77 (mBERT), 72.94 (MuRIL) | 25.46 (mBERT), 69.75 (MuRIL) | 8.34 (mBERT), 68.77 (MuRIL) | 7.30 (mBERT), 68.41 (MuRIL) | 14.24 (mBERT), 57.70 (MuRIL) |
UDPOS | F1 | - | - | - | - | 25.00 (mBERT), 63.09 (MuRIL) | 33.67 (mBERT), 67.19 (MuRIL) | 24.02 (mBERT), 58.40 (MuRIL) | 36.21 (mBERT), 65.30 (MuRIL) |
XNLI | Accuracy | - | - | - | - | 39.6 (mBERT), 68.24 (MuRIL) | - | 38.86 (mBERT), 61.16 (MuRIL) | 39.23 (mBERT), 64.70 (MuRIL) |
Tatoeba | Accuracy | 2.18 (mBERT), 10.33 (MuRIL) | 1.95 (mBERT), 11.07 (MuRIL) | 5.13 (mBERT), 11.54 (MuRIL) | 1.80 (mBERT), 8.10 (MuRIL) | 3.00 (mBERT), 14.90 (MuRIL) | 2.40 (mBERT), 7.20 (MuRIL) | 2.30 (mBERT), 13.70 (MuRIL) | 2.68 (mBERT), 10.98 (MuRIL) |
References
[1]: Jacob Devlin, Ming - Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre - training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018. [2]: Wikipedia [3]: [Common Crawl](http://commoncrawl.org/the - data/) [4]: [PMINDIA](http://lotus.kuee.kyoto - u.ac.jp/WAT/indic - multilingual/index.html) [5]: [Dakshina](https://github.com/google - research - datasets/dakshina) [6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur). [7]: Conneau, Alexis, et al. Unsupervised cross - lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019). [8]: [IndicTrans](https://github.com/libindic/indic - trans) [9]: Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). Xtreme: A massively multilingual multi - task benchmark for evaluating cross - lingual generalization. arXiv preprint arXiv:2003.11080. [10]: Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020). FILTER: An Enhanced Fusion Method for Cross - lingual Language Understanding. arXiv preprint arXiv:2009.05166.
Citation
If you find MuRIL useful in your applications, please cite the following paper:
@misc{khanuja2021muril,
title={MuRIL: Multilingual Representations for Indian Languages},
author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
year={2021},
eprint={2103.10730},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contact
Please mail your queries/feedback to muril - contact@google.com.
đ License
This project is licensed under the apache - 2.0 license.

