đ Google's mT5
mT5 is a multilingual pre - trained text - to - text transformer that is pre - trained on a large - scale multilingual corpus, enabling it to handle various NLP tasks across 101 languages.
đ Quick Start
mT5 is pretrained on the mC4 corpus, which covers the following 101 languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
â ī¸ Important Note
mT5 was only pre - trained on mC4 without any supervised training. Therefore, this model has to be fine - tuned before it is useable on a downstream task.
đ Documentation
Pretraining Dataset
The model is pretrained on the mC4 dataset.
Other Community Checkpoints
You can find other community checkpoints here.
Paper
The official paper is mT5: A massively multilingual pre - trained text - to - text transformer.
Authors
The authors of the paper are Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al - Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
đ License
The model is released under the apache - 2.0 license.
đ Information Table
Property |
Details |
Model Type |
Google's mT5 |
Training Data |
mC4 |
License |
apache - 2.0 |
đ§ Technical Details
Abstract
The recent "Text - to - Text Transfer Transformer" (T5) leveraged a unified text - to - text format and scale to attain state - of - the - art results on a wide variety of English - language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre - trained on a new Common Crawl - based dataset covering 101 languages. We describe the design and modified training of mT5 and demonstrate its state - of - the - art performance on many multilingual benchmarks. All of the code and model checkpoints used in this work are publicly available.