🚀 CINO: Pre-trained Language Models for Chinese Minority Languages
CINO is a multilingual pre-trained language model that aims to address the lack of pre-trained models for Chinese minority languages. It enhances the XLM - R model with additional pre - training on Chinese minority language corpora, offering new possibilities for NLP research in these languages.
🚀 Quick Start
To learn more about CINO, please visit our GitHub repository (in Chinese): [https://github.com/ymcui/Chinese - Minority - PLM](https://github.com/ymcui/Chinese - Minority - PLM)
✨ Features
- Multilingual Support: CINO supports multiple Chinese minority languages, including Chinese (zh), Tibetan (bo), Mongolian (mn), Uyghur (ug), Kazakh (kk), Korean (ko), Zhuang, and Cantonese (yue).
- Enhanced XLM - R: Built on the foundation of XLM - R, CINO undergoes additional pre - training with Chinese minority language corpora, improving its performance in understanding these languages.
📦 Installation
The original README does not provide installation steps, so this section is skipped.
💻 Usage Examples
The original README does not provide code examples, so this section is skipped.
📚 Documentation
Multilingual pre - trained language models like mBERT and XLM - R offer multilingual and cross - lingual capabilities for language understanding. In recent years, there has been rapid progress in building multilingual pre - trained language models (PLMs). However, there is a lack of contributions in building PLMs for Chinese minority languages, which restricts researchers from developing powerful NLP systems.
To address this gap, the Joint Laboratory of HIT and iFLYTEK Research (HFL) proposes CINO. It is based on XLM - R and further pre - trained with Chinese minority language corpora, such as:
- Chinese, 中文 (zh)
- Tibetan, 藏语 (bo)
- Mongolian (Uighur form), 蒙语 (mn)
- Uyghur, 维吾尔语 (ug)
- Kazakh (Arabic form), 哈萨克语 (kk)
- Korean, 朝鲜语 (ko)
- Zhuang, 壮语
- Cantonese, 粤语 (yue)
🔧 Technical Details
The original README does not provide specific technical details (more than 50 words), so this section is skipped.
📄 License
This project is licensed under the "apache - 2.0" license.
You may also be interested in the following related projects:
- Chinese MacBERT: https://github.com/ymcui/MacBERT
- Chinese BERT series: [https://github.com/ymcui/Chinese - BERT - wwm](https://github.com/ymcui/Chinese - BERT - wwm)
- Chinese ELECTRA: [https://github.com/ymcui/Chinese - ELECTRA](https://github.com/ymcui/Chinese - ELECTRA)
- Chinese XLNet: [https://github.com/ymcui/Chinese - XLNet](https://github.com/ymcui/Chinese - XLNet)
- Knowledge Distillation Toolkit - TextBrewer: https://github.com/airaria/TextBrewer
- More resources by HFL: [https://github.com/ymcui/HFL - Anthology](https://github.com/ymcui/HFL - Anthology)