🚀 ARBERTv2
ARBERTv2 是我們在 ACL 2021 論文 "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic" 中所描述的 ARBERT 模型的更新版本。ARBERTv2 在我們的論文 "ORCA: A Challenging Benchmark for Arabic Language Understanding" 中被提出。該模型在現代標準阿拉伯語(MSA)數據上進行訓練,使用了 243 GB 的文本和 278 億個標記。

✨ 主要特性
- 語言標籤:支持阿拉伯語相關任務,標籤涵蓋阿拉伯語 BERT、現代標準阿拉伯語(MSA)、推特數據、掩碼語言模型等。
- 可視化示例:提供了一個可視化小部件,示例文本為 "اللغة [MASK] هي لغة العرب"。
- 數據規模:在大規模的現代標準阿拉伯語數據上進行訓練,文本量達 243 GB,標記數達 278 億。
📚 詳細文檔
引用格式
如果您在科學出版物中使用我們的模型(ARBERTv2),或者認為本倉庫中的資源很有用,請按以下方式引用我們的論文(待更新):
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and
Elmadany, AbdelRahim and
Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.551",
doi = "10.18653/v1/2021.acl-long.551",
pages = "7088--7105",
abstract = "Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.",
}
@article{elmadany2022orca,
title={ORCA: A Challenging Benchmark for Arabic Language Understanding},
author={Elmadany, AbdelRahim and Nagoudi, El Moatez Billah and Abdul-Mageed, Muhammad},
journal={arXiv preprint arXiv:2212.10758},
year={2022}
}
致謝
我們衷心感謝加拿大自然科學與工程研究委員會、加拿大社會科學與人文研究委員會、加拿大創新基金會、ComputeCanada 和 UBC ARC - Sockeye 的支持。我們也感謝 Google TensorFlow Research Cloud (TFRC) 項目為我們提供免費的 TPU 訪問權限。
📄 許可證
文檔中未提及相關許可證信息。若有需要,請補充相關內容。