🚀 MARBERT模型
MARBERT是一個專注於方言阿拉伯語(DA)和現代標準阿拉伯語(MSA)的大規模預訓練掩碼語言模型。它能有效處理阿拉伯語的多種變體,為阿拉伯語自然語言處理任務提供強大支持。
✨ 主要特性
- 多語言支持:支持阿拉伯語,涵蓋方言阿拉伯語和現代標準阿拉伯語。
- 大規模預訓練:基於約128GB文本(156億個標記)的大規模數據集進行預訓練。
- 架構優化:採用與ARBERT(BERT - base)相同的網絡架構,但去除了下一句預測(NSP)目標,以適應推文的短文本特性。
📚 詳細文檔
模型概述
MARBERT 是我們在 ACL 2021論文 "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic" 中描述的三個模型之一。阿拉伯語有多種變體,為了訓練MARBERT,我們從一個約60億條推文的大型內部數據集中隨機抽取了10億條阿拉伯語推文。我們僅納入至少包含3個阿拉伯語單詞的推文(基於字符串匹配),無論推文中是否包含非阿拉伯語字符串。也就是說,只要推文滿足3個阿拉伯語單詞的標準,我們就不會去除非阿拉伯語內容。該數據集構成了 128GB的文本(156億個標記)。我們使用與ARBERT(BERT - base)相同的網絡架構,但由於推文較短,去除了下一句預測(NSP)目標。有關修改BERT代碼以去除NSP的詳細信息,請參閱我們的 倉庫。如需瞭解更多關於MARBERT的信息,請訪問我們的GitHub 倉庫。
模型信息
屬性 |
詳情 |
模型類型 |
大規模預訓練掩碼語言模型 |
訓練數據 |
從約60億條推文的大型內部數據集中隨機抽取的10億條阿拉伯語推文,構成128GB文本(156億個標記) |
BibTex引用
如果您在科學出版物中使用我們的模型(ARBERT、MARBERT或MARBERTv2),或者發現本倉庫中的資源有用,請按以下方式引用我們的論文(待更新):
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and
Elmadany, AbdelRahim and
Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.551",
doi = "10.18653/v1/2021.acl-long.551",
pages = "7088--7105",
abstract = "Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.",
}
🔗 致謝
我們衷心感謝加拿大自然科學與工程研究委員會、加拿大社會科學與人文研究委員會、加拿大創新基金會、ComputeCanada 和 UBC ARC - Sockeye 的支持。我們也感謝 Google TensorFlow Research Cloud (TFRC) 計劃為我們提供免費的TPU訪問權限。