muril-base-cased Open-source Model - Optimized for the Indian Context, Supports Text Processing of 17 Indian Languages

Muril Base Cased

Developed by google

MuRIL is a BERT model pre-trained on 17 Indian languages and their transcribed texts, optimized for Indian contexts

Large Language Model Open Source License:Apache-2.0 #Indian Multilingual Model #Transcribed Text Optimization #Zero-shot Transfer

Downloads 12.72k

Release Time : 3/2/2022

Model Overview

MuRIL is a multilingual model based on the BERT architecture, specifically pre-trained for 17 Indian languages with special optimizations for transcribed texts

Model Features

Multilingual Support

Supports 17 Indian languages and their transcribed texts

Transcription Optimization

Specifically optimized for Indian language transcription phenomena

Parallel Data Training

Uses translation and transcribed text pairs for pre-training

Low-resource Language Optimization

Employs an upsampling exponent of 0.3 to enhance performance for low-resource languages

Model Capabilities

Multilingual Text Understanding

Transcribed Text Processing

Masked Language Modeling

Cross-lingual Transfer Learning

Use Cases

Natural Language Processing

Named Entity Recognition

Named entity recognition tasks for Indian languages

Achieved an average F1 score of 77.60% on PANX tasks, significantly outperforming mBERT

Part-of-Speech Tagging

Part-of-speech tagging tasks for Indian languages

Achieved an average F1 score of 75.02% on UDPOS tasks, outperforming mBERT

Cross-lingual Natural Language Inference

XNLI tasks for Indian languages

Transcribed text accuracy improved from 39.23% to 64.70%

🚀 MuRIL: Multilingual Representations for Indian Languages

MuRIL is a BERT model pre - trained on 17 Indian languages and their transliterated counterparts. It offers masked word prediction capabilities and an encoder on TFHub, providing valuable multilingual representation for Indian languages.

🚀 Quick Start

MuRIL is a BERT model pre - trained on 17 Indian languages and their transliterated counterparts. In this repository, we've released the pre - trained model with the MLM layer intact, which enables masked word predictions. We've also made the encoder available on TFHub, along with an additional pre - processing module that transforms raw text into the expected input format for the encoder. You can find more details about MuRIL in this paper.

✨ Features

Multilingual Pretraining: The model uses a BERT base architecture pretrained from scratch using Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora for 17 Indian languages.
Modified Training Paradigm: Similar to multilingual BERT but with modifications. It includes translation and transliteration segment pairs in training and uses an exponent value of 0.3 for upsampling to enhance low - resource performance.

📚 Documentation

Overview

This model employs a BERT base architecture [1] that is pretrained from scratch. The training data sources are the Wikipedia [2], Common Crawl [3], PMINDIA [4], and Dakshina [5] corpora for 17 [6] Indian languages.

The training paradigm is similar to multilingual BERT, with the following modifications:

Inclusion of Segment Pairs: We include translation and transliteration segment pairs in training.
Upsampling Exponent: We use an exponent value of 0.3 (instead of 0.7) for upsampling, which has been shown to enhance low - resource performance [7].

Training

The MuRIL model is pre - trained on both monolingual and parallel segments:

Monolingual Data: We utilize publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
Parallel Data:
- Translated Data: We obtain translations of the above monolingual corpora using the Google NMT pipeline and also use the publicly available PMINDIA corpus.
- Transliterated Data: We obtain transliterations of Wikipedia using the IndicTrans [8] library and also use the publicly available Dakshina dataset.

We use an exponent value of 0.3 to calculate duplication multiplier values for upsampling lower - resourced languages and set dupe factors accordingly. Note that we limit transliterated pairs to Wikipedia only.

The model was trained using a self - supervised masked language modeling task with whole word masking (maximum of 80 predictions). It was trained for 1000K steps, with a batch size of 4096 and a max sequence length of 512.

Trainable parameters

All parameters in the module are trainable, and fine - tuning all parameters is the recommended practice.

Uses & Limitations

This model is designed for a variety of downstream NLP tasks for Indian languages. It is trained on transliterated data, which is common in the Indian context. However, it may not perform well on languages other than the 17 Indian languages used in pretraining.

Evaluation

We provide the results of fine - tuning this model on a set of downstream tasks from the XTREME benchmark, evaluated on Indian language test - sets and their transliterated versions.

Results on XTREME benchmark datasets (in %)

Task	Metric	en	hi	mr	ta	te	ur	bn	ml	Average
PANX	F1	84.40 (mBERT), 84.43 (MuRIL)	65.13 (mBERT), 78.09 (MuRIL)	58.44 (mBERT), 74.63 (MuRIL)	51.24 (mBERT), 71.86 (MuRIL)	50.16 (mBERT), 64.99 (MuRIL)	31.36 (mBERT), 85.07 (MuRIL)	68.59 (mBERT), 85.97 (MuRIL)	54.77 (mBERT), 75.74 (MuRIL)	58.01 (mBERT), 77.60 (MuRIL)
UDPOS	F1	95.35 (mBERT), 95.55 (MuRIL)	66.09 (mBERT), 64.47 (MuRIL)	71.27 (mBERT), 82.95 (MuRIL)	59.58 (mBERT), 62.57 (MuRIL)	76.98 (mBERT), 85.63 (MuRIL)	57.85 (mBERT), 58.93 (MuRIL)	-	-	71.19 (mBERT), 75.02 (MuRIL)
XNLI	Accuracy	81.72 (mBERT), 83.85 (MuRIL)	60.52 (mBERT), 70.66 (MuRIL)	-	-	-	58.20 (mBERT), 67.70 (MuRIL)	-	-	66.81 (mBERT), 74.07 (MuRIL)
Tatoeba	Accuracy	-	27.80 (mBERT), 31.50 (MuRIL)	18.00 (mBERT), 26.60 (MuRIL)	12.38 (mBERT), 36.81 (MuRIL)	14.96 (mBERT), 17.52 (MuRIL)	22.70 (mBERT), 17.10 (MuRIL)	12.80 (mBERT), 20.20 (MuRIL)	20.23 (mBERT), 26.35 (MuRIL)	18.41 (mBERT), 25.15 (MuRIL)
XQUAD	F1/EM	83.85/72.86 (mBERT), 84.31/72.94 (MuRIL)	58.46/43.53 (mBERT), 73.93/58.32 (MuRIL)	-	-	-	-	-	-	71.15/58.19 (mBERT), 79.12/65.63 (MuRIL)
MLQA	F1/EM	80.39/67.30 (mBERT), 80.28/67.37 (MuRIL)	50.28/35.18 (mBERT), 67.34/50.22 (MuRIL)	-	-	-	-	-	-	65.34/51.24 (mBERT), 73.81/58.80 (MuRIL)
TyDiQA	F1/EM	75.21/65.00 (mBERT), 74.10/64.55 (MuRIL)	60.62/45.13 (mBERT), 78.03/66.37 (MuRIL)	-	-	53.55/44.54 (mBERT), 73.95/46.94 (MuRIL)	-	-	-	63.13/51.66 (mBERT), 75.36/59.28 (MuRIL)

Results on transliterated test - sets

Task	Metric	ml_tr	ta_tr	te_tr	bn_tr	hi_tr	mr_tr	ur_tr	Average
PANX	F1	7.53 (mBERT), 63.39 (MuRIL)	1.04 (mBERT), 7.00 (MuRIL)	8.24 (mBERT), 53.62 (MuRIL)	41.77 (mBERT), 72.94 (MuRIL)	25.46 (mBERT), 69.75 (MuRIL)	8.34 (mBERT), 68.77 (MuRIL)	7.30 (mBERT), 68.41 (MuRIL)	14.24 (mBERT), 57.70 (MuRIL)
UDPOS	F1	-	-	-	-	25.00 (mBERT), 63.09 (MuRIL)	33.67 (mBERT), 67.19 (MuRIL)	24.02 (mBERT), 58.40 (MuRIL)	36.21 (mBERT), 65.30 (MuRIL)
XNLI	Accuracy	-	-	-	-	39.6 (mBERT), 68.24 (MuRIL)	-	38.86 (mBERT), 61.16 (MuRIL)	39.23 (mBERT), 64.70 (MuRIL)
Tatoeba	Accuracy	2.18 (mBERT), 10.33 (MuRIL)	1.95 (mBERT), 11.07 (MuRIL)	5.13 (mBERT), 11.54 (MuRIL)	1.80 (mBERT), 8.10 (MuRIL)	3.00 (mBERT), 14.90 (MuRIL)	2.40 (mBERT), 7.20 (MuRIL)	2.30 (mBERT), 13.70 (MuRIL)	2.68 (mBERT), 10.98 (MuRIL)

References

[1]: Jacob Devlin, Ming - Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre - training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018. [2]: Wikipedia [3]: [Common Crawl](http://commoncrawl.org/the - data/) [4]: [PMINDIA](http://lotus.kuee.kyoto - u.ac.jp/WAT/indic - multilingual/index.html) [5]: [Dakshina](https://github.com/google - research - datasets/dakshina) [6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur). [7]: Conneau, Alexis, et al. Unsupervised cross - lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019). [8]: [IndicTrans](https://github.com/libindic/indic - trans) [9]: Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). Xtreme: A massively multilingual multi - task benchmark for evaluating cross - lingual generalization. arXiv preprint arXiv:2003.11080. [10]: Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020). FILTER: An Enhanced Fusion Method for Cross - lingual Language Understanding. arXiv preprint arXiv:2009.05166.

Citation

If you find MuRIL useful in your applications, please cite the following paper:

@misc{khanuja2021muril,
      title={MuRIL: Multilingual Representations for Indian Languages},
      author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
      year={2021},
      eprint={2103.10730},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Please mail your queries/feedback to muril - contact@google.com.

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご