đ Google's mT5
mT5 is a multilingual pre - trained model. It is pre - trained on a large - scale corpus, aiming to provide high - quality language processing capabilities across multiple languages. It has shown excellent performance in many multilingual benchmarks.
đ Quick Start
mT5 is based on Google's research. You can access its official GitHub repository here.
⨠Features
Multilingual Support
mT5 is pre - trained on the mC4 corpus, covering 101 languages, including Afrikaans, Albanian, Amharic, Arabic, and many others. A full list of languages is as follows:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
Pretraining Strategy
It's important to note that mT5 was only pre - trained on mC4 without any supervised training. So, this model needs to be fine - tuned before it can be used on a downstream task.
đ Documentation
Pretraining Dataset
The model is pre - trained on the mC4 dataset.
Other Community Checkpoints
You can find other community checkpoints here.
Paper
The official paper is mT5: A massively multilingual pre - trained text - to - text transformer.
Authors
The paper was written by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al - Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
Abstract
The recent "Text - to - Text Transfer Transformer" (T5) leveraged a unified text - to - text format and scale to attain state - of - the - art results on a wide variety of English - language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre - trained on a new Common Crawl - based dataset covering 101 languages. We describe the design and modified training of mT5 and demonstrate its state - of - the - art performance on many multilingual benchmarks. All of the code and model checkpoints used in this work are publicly available.
đ License
This model is licensed under the Apache 2.0 license.
Information Table
Property |
Details |
Supported Languages |
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu |
Pretraining Dataset |
mC4 |
Model Type |
Multilingual pre - trained text - to - text transformer |
License |
Apache 2.0 |
Important Note
â ī¸ Important Note
mT5 was only pre - trained on mC4 excluding any supervised training. Therefore, this model has to be fine - tuned before it is useable on a downstream task.