ARBERTv2开源语言模型 - 基于阿拉伯语大数据训练助力文本处理

首页

Arbertv2

由 UBC-NLP 开发

ARBERTv2是基于现代标准阿拉伯语(MSA)训练的升级版BERT模型，训练语料达243GB文本，包含278亿词元。

大型语言模型

Transformers

阿拉伯语#阿拉伯语BERT #现代标准阿拉伯语 #推特文本处理

下载量 267

发布时间 : 4/11/2023

模型简介

ARBERTv2是面向阿拉伯语的深度双向Transformer模型，专注于现代标准阿拉伯语处理，特别适用于推特等社交媒体文本分析。

模型特点

大规模阿拉伯语训练

基于243GB现代标准阿拉伯语文本训练，包含278亿词元

专注MSA处理

特别优化对现代标准阿拉伯语(MSA)的理解能力

社交媒体适应

训练数据包含推特文本，适合社交媒体分析

模型能力

阿拉伯语文本理解

掩码语言预测

社交媒体文本分析

使用案例

自然语言处理

阿拉伯语完形填空

预测被掩码的阿拉伯语词汇

示例：能准确预测'اللغة العربية هي لغة العرب'中的'العربية'

社交媒体分析

分析阿拉伯语推特内容

🚀 ARBERTv2

ARBERTv2 是我们在 ACL 2021 论文 "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic" 中所描述的 ARBERT 模型的更新版本。ARBERTv2 在我们的论文 "ORCA: A Challenging Benchmark for Arabic Language Understanding" 中被提出。该模型在现代标准阿拉伯语（MSA）数据上进行训练，使用了 243 GB 的文本和 278 亿个标记。

ARBERT_MARBERT

✨ 主要特性

语言标签：支持阿拉伯语相关任务，标签涵盖阿拉伯语 BERT、现代标准阿拉伯语（MSA）、推特数据、掩码语言模型等。
可视化示例：提供了一个可视化小部件，示例文本为 "اللغة [MASK] هي لغة العرب"。
数据规模：在大规模的现代标准阿拉伯语数据上进行训练，文本量达 243 GB，标记数达 278 亿。

📚 详细文档

引用格式

如果您在科学出版物中使用我们的模型（ARBERTv2），或者认为本仓库中的资源很有用，请按以下方式引用我们的论文（待更新）：

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad  and
      Elmadany, AbdelRahim  and
      Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.551",
    doi = "10.18653/v1/2021.acl-long.551",
    pages = "7088--7105",
    abstract = "Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.",
}

@article{elmadany2022orca,
  title={ORCA: A Challenging Benchmark for Arabic Language Understanding},
  author={Elmadany, AbdelRahim and Nagoudi, El Moatez Billah and Abdul-Mageed, Muhammad},
  journal={arXiv preprint arXiv:2212.10758},
  year={2022}
}

致谢

我们衷心感谢加拿大自然科学与工程研究委员会、加拿大社会科学与人文研究委员会、加拿大创新基金会、ComputeCanada 和 UBC ARC - Sockeye 的支持。我们也感谢 Google TensorFlow Research Cloud (TFRC) 项目为我们提供免费的 TPU 访问权限。