🚀 ARBERTv2
ARBERTv2 是我们在 ACL 2021 论文 "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic" 中所描述的 ARBERT 模型的更新版本。ARBERTv2 在我们的论文 "ORCA: A Challenging Benchmark for Arabic Language Understanding" 中被提出。该模型在现代标准阿拉伯语(MSA)数据上进行训练,使用了 243 GB 的文本和 278 亿个标记。

✨ 主要特性
- 语言标签:支持阿拉伯语相关任务,标签涵盖阿拉伯语 BERT、现代标准阿拉伯语(MSA)、推特数据、掩码语言模型等。
- 可视化示例:提供了一个可视化小部件,示例文本为 "اللغة [MASK] هي لغة العرب"。
- 数据规模:在大规模的现代标准阿拉伯语数据上进行训练,文本量达 243 GB,标记数达 278 亿。
📚 详细文档
引用格式
如果您在科学出版物中使用我们的模型(ARBERTv2),或者认为本仓库中的资源很有用,请按以下方式引用我们的论文(待更新):
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and
Elmadany, AbdelRahim and
Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.551",
doi = "10.18653/v1/2021.acl-long.551",
pages = "7088--7105",
abstract = "Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.",
}
@article{elmadany2022orca,
title={ORCA: A Challenging Benchmark for Arabic Language Understanding},
author={Elmadany, AbdelRahim and Nagoudi, El Moatez Billah and Abdul-Mageed, Muhammad},
journal={arXiv preprint arXiv:2212.10758},
year={2022}
}
致谢
我们衷心感谢加拿大自然科学与工程研究委员会、加拿大社会科学与人文研究委员会、加拿大创新基金会、ComputeCanada 和 UBC ARC - Sockeye 的支持。我们也感谢 Google TensorFlow Research Cloud (TFRC) 项目为我们提供免费的 TPU 访问权限。
📄 许可证
文档中未提及相关许可证信息。若有需要,请补充相关内容。