ARBERTオープンソース言語モデル - 大量のアラビア語テキストを基に訓練され、アラビア語アプリケーションをサポート

Home

ARBERT

Developed by UBC-NLP

ARBERTは現代標準アラビア語(MSA)向けの大規模事前学習マスク言語モデルで、BERT-baseアーキテクチャに基づき、61GBのアラビア語テキストで学習されています。

大規模言語モデル Arabic#現代標準アラビア語事前学習 #ツイートテキスト最適化 #深層双方向Transformer

Downloads 1,082

Release Time : 3/2/2022

Model Overview

ARBERTは深層双方向Transformerモデルで、現代標準アラビア語向けに設計され、アラビア語テキストのマスク言語モデリングタスクを処理します。

Model Features

大規模アラビア語事前学習

61GBのアラビア語テキスト(62億トークン)を使用して学習され、現代標準アラビア語に最適化されています

BERT-base互換アーキテクチャ

標準BERT-baseアーキテクチャ(12層/12ヘッド/768次元)を採用し、転移学習と微調整が容易です

専門語彙表

10万トークンのアラビア語専用語彙表を含みます

Model Capabilities

アラビア語テキスト理解

マスク言語モデリング

テキスト分類

固有表現認識

Use Cases

ソーシャルメディア分析

アラビア語ツイート感情分析

アラビア語ソーシャルメディアコンテンツの感情傾向を判断

ARLUEベンチマークで優れた性能を発揮

教育テクノロジー

アラビア語文法チェック

現代標準アラビア語テキストの文法誤りを自動検出

🚀 ARBERT

ARBERTは、我々のACL 2021論文 "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic" で説明されている3つのモデルの1つです。ARBERTは、現代標準アラビア語（MSA）に焦点を当てた大規模な事前学習済みマスク言語モデルです。ARBERTを学習するために、BERT-baseと同じアーキテクチャを使用しています。具体的には、12のアテンションレイヤーがあり、各レイヤーには12のアテンションヘッドと768の隠れ次元があり、語彙数は100KのWordPiecesで、約1億6300万のパラメータを持ちます。我々は、61GBのテキスト（62億のトークン）を含むアラビア語データセットのコレクションでARBERTを学習させました。詳細については、当社のGitHub リポジトリをご覧ください。

🚀 クイックスタート

このモデルは、アラビア語の自然言語処理タスクに役立ちます。以下の情報を参考に、モデルを利用してみてください。

モデル情報

属性	详情
モデルタイプ	事前学習済みマスク言語モデル
学習データ	61GBのテキスト（62億のトークン）

ウィジェット

{
  "language": [
    "ar"
  ],
  "tags": [
    "Arabic BERT",
    "MSA",
    "Twitter",
    "Masked Langauge Model"
  ],
  "widget": [
    {
      "text": "اللغة العربية هي لغة [MASK]."
    }
  ]
}

モデルの可視化

📚 ドキュメント

BibTex

もしあなたが科学的な出版物で我々のモデル（ARBERT、MARBERT、またはMARBERTv2）を使用する場合、またはこのリポジトリのリソースが有用だと思った場合は、以下のように我々の論文を引用してください（更新予定）。

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad  and
      Elmadany, AbdelRahim  and
      Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.551",
    doi = "10.18653/v1/2021.acl-long.551",
    pages = "7088--7105",
    abstract = "Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large ( 3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.",
}

謝辞

我々は、カナダ自然科学・工学研究評議会、カナダ社会科学・人文科学研究評議会、カナダイノベーション財団、ComputeCanada、およびUBC ARC-Sockeyeからの支援に感謝します。また、Google TensorFlow Research Cloud (TFRC)プログラムが無料のTPUアクセスを提供してくれたことにも感謝します。