ARBERTv2開源語言模型 - 基於阿拉伯語大數據訓練助力文本處理

首頁

Arbertv2

由UBC-NLP開發

ARBERTv2是基於現代標準阿拉伯語(MSA)訓練的升級版BERT模型，訓練語料達243GB文本，包含278億詞元。

大型語言模型

Transformers

阿拉伯語#阿拉伯語BERT #現代標準阿拉伯語 #推特文本處理

下載量 267

發布時間 : 4/11/2023

模型概述

ARBERTv2是面向阿拉伯語的深度雙向Transformer模型，專注於現代標準阿拉伯語處理，特別適用於推特等社交媒體文本分析。

模型特點

大規模阿拉伯語訓練

基於243GB現代標準阿拉伯語文本訓練，包含278億詞元

專注MSA處理

特別優化對現代標準阿拉伯語(MSA)的理解能力

社交媒體適應

訓練數據包含推特文本，適合社交媒體分析

模型能力

阿拉伯語文本理解

掩碼語言預測

社交媒體文本分析

使用案例

自然語言處理

阿拉伯語完形填空

預測被掩碼的阿拉伯語詞彙

示例：能準確預測'اللغة العربية هي لغة العرب'中的'العربية'

社交媒體分析

分析阿拉伯語推特內容

🚀 ARBERTv2

ARBERTv2 是我們在 ACL 2021 論文 "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic" 中所描述的 ARBERT 模型的更新版本。ARBERTv2 在我們的論文 "ORCA: A Challenging Benchmark for Arabic Language Understanding" 中被提出。該模型在現代標準阿拉伯語（MSA）數據上進行訓練，使用了 243 GB 的文本和 278 億個標記。

ARBERT_MARBERT

✨ 主要特性

語言標籤：支持阿拉伯語相關任務，標籤涵蓋阿拉伯語 BERT、現代標準阿拉伯語（MSA）、推特數據、掩碼語言模型等。
可視化示例：提供了一個可視化小部件，示例文本為 "اللغة [MASK] هي لغة العرب"。
數據規模：在大規模的現代標準阿拉伯語數據上進行訓練，文本量達 243 GB，標記數達 278 億。

📚 詳細文檔

引用格式

如果您在科學出版物中使用我們的模型（ARBERTv2），或者認為本倉庫中的資源很有用，請按以下方式引用我們的論文（待更新）：

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad  and
      Elmadany, AbdelRahim  and
      Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.551",
    doi = "10.18653/v1/2021.acl-long.551",
    pages = "7088--7105",
    abstract = "Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.",
}

@article{elmadany2022orca,
  title={ORCA: A Challenging Benchmark for Arabic Language Understanding},
  author={Elmadany, AbdelRahim and Nagoudi, El Moatez Billah and Abdul-Mageed, Muhammad},
  journal={arXiv preprint arXiv:2212.10758},
  year={2022}
}

致謝

我們衷心感謝加拿大自然科學與工程研究委員會、加拿大社會科學與人文研究委員會、加拿大創新基金會、ComputeCanada 和 UBC ARC - Sockeye 的支持。我們也感謝 Google TensorFlow Research Cloud (TFRC) 項目為我們提供免費的 TPU 訪問權限。