開源CAMeLBERT CA版模型 - 專用於古典阿拉伯語文本處理的免費工具

首頁

Bert Base Arabic Camelbert Ca

由CAMeL-Lab開發

CAMeLBERT是針對阿拉伯語變體優化的BERT模型集合，CA版本專門針對古典阿拉伯語文本預訓練

大型語言模型阿拉伯語開源協議:Apache-2.0 #古典阿拉伯語處理 #多任務微調 #阿拉伯語NLP

下載量 1,128

發布時間 : 3/2/2022

模型概述

基於古典阿拉伯語(CA)數據集預訓練的BERT模型，適用於阿拉伯語NLP任務微調

模型特點

古典阿拉伯語優化

專門針對6GB古典阿拉伯語文本預訓練，在詩歌分類等CA任務上表現優異(F1 80.9%)

多任務適配

支持NER、詞性標註、情感分析、方言識別和詩歌分類等12個阿拉伯語NLP任務

變體敏感處理

保留字母大小寫及重音符號，採用全詞掩碼策略增強語言特徵學習

模型能力

掩碼語言建模

下一句預測

命名實體識別

詞性標註

情感分析

方言識別

詩歌分類

使用案例

古典文學分析

阿拉伯詩歌分類

對古典阿拉伯詩歌進行自動分類

在APCD數據集上達到80.9% F1分數

語言學研究

古典文本分析

分析古典阿拉伯語文本的語言特徵

教育技術

阿拉伯語學習輔助

幫助學習者理解古典阿拉伯語語法和詞彙

🚀 CAMeLBERT：用於阿拉伯語自然語言處理任務的預訓練模型集合

CAMeLBERT是一系列針對阿拉伯語自然語言處理任務的預訓練模型。這些模型基於不同規模和變體的阿拉伯語文本進行預訓練，可用於多種自然語言處理任務，如命名實體識別、詞性標註、情感分析等。

🚀 快速開始

你可以直接使用該模型進行掩碼語言建模任務：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-arabic-camelbert-ca')
>>> unmasker("الهدف من الحياة هو [MASK] .")
[{'sequence': '[CLS] الهدف من الحياة هو الحياة. [SEP]',
  'score': 0.11048116534948349,
  'token': 3696,
  'token_str': 'الحياة'},
 {'sequence': '[CLS] الهدف من الحياة هو الإسلام. [SEP]',
  'score': 0.03481195122003555,
  'token': 4677,
  'token_str': 'الإسلام'},
 {'sequence': '[CLS] الهدف من الحياة هو الموت. [SEP]',
  'score': 0.03402028977870941,
  'token': 4295,
  'token_str': 'الموت'},
 {'sequence': '[CLS] الهدف من الحياة هو العلم. [SEP]',
  'score': 0.027655426412820816,
  'token': 2789,
  'token_str': 'العلم'},
 {'sequence': '[CLS] الهدف من الحياة هو هذا. [SEP]',
  'score': 0.023059621453285217,
  'token': 2085,
  'token_str': 'هذا'}]

注意：要下載我們的模型，你需要transformers>=3.5.0。否則，你可以手動下載模型。

以下是在PyTorch中使用該模型獲取給定文本特徵的方法：

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

在TensorFlow中的使用方法：

from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

✨ 主要特性

多語言變體支持：提供針對現代標準阿拉伯語（MSA）、方言阿拉伯語（DA）、古典阿拉伯語（CA）以及三者混合的預訓練模型。
不同規模模型：除了標準規模的模型，還提供了基於MSA變體按比例縮小的模型（二分之一、四分之一、八分之一和十六分之一）。
廣泛的任務適用性：可用於掩碼語言建模、下一句預測，並且適合在多種NLP任務上進行微調，如命名實體識別、詞性標註、情感分析、方言識別和詩歌分類等。

📦 安裝指南

要使用這些模型，你需要安裝transformers庫，並且版本需大於等於3.5.0：

pip install transformers>=3.5.0

💻 使用示例

基礎用法

from transformers import pipeline
unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-arabic-camelbert-ca')
result = unmasker("الهدف من الحياة هو [MASK] .")
print(result)

高級用法

# 在PyTorch中獲取文本特徵
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
# 可以進一步處理輸出結果，如提取特徵等

📚 詳細文檔

模型描述

CAMeLBERT是一系列基於不同規模和變體的阿拉伯語文本進行預訓練的BERT模型集合。我們發佈了針對現代標準阿拉伯語（MSA）、方言阿拉伯語（DA）、古典阿拉伯語（CA）的預訓練語言模型，以及一個基於三者混合數據預訓練的模型。此外，還提供了基於MSA變體按比例縮小的額外模型。詳細信息請參考論文 "The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models"。

本模型卡片描述的是CAMeLBERT - CA (bert-base-arabic-camelbert-ca)，這是一個基於古典阿拉伯語（CA）數據集預訓練的模型。

屬性	詳情
模型類型	`bert-base-arabic-camelbert-ca`
訓練數據	CA（古典阿拉伯語）：[OpenITI (Version 2020.1.2)](https://zenodo.org/record/3891466#.YEX4 - F0zbzc)

各模型的詳細信息如下：

	模型	變體	大小	詞數
	`bert-base-arabic-camelbert-mix`	CA,DA,MSA	167GB	17.3B
✔	`bert-base-arabic-camelbert-ca`	CA	6GB	847M
	`bert-base-arabic-camelbert-da`	DA	54GB	5.8B
	`bert-base-arabic-camelbert-msa`	MSA	107GB	12.6B
	`bert-base-arabic-camelbert-msa-half`	MSA	53GB	6.3B
	`bert-base-arabic-camelbert-msa-quarter`	MSA	27GB	3.1B
	`bert-base-arabic-camelbert-msa-eighth`	MSA	14GB	1.6B
	`bert-base-arabic-camelbert-msa-sixteenth`	MSA	6GB	746M

預期用途

你可以將發佈的模型用於掩碼語言建模或下一句預測任務。不過，該模型主要用於在NLP任務上進行微調，如命名實體識別（NER）、詞性標註（POS tagging）、情感分析、方言識別和詩歌分類等。我們的微調代碼可在[這裡](https://github.com/CAMeL - Lab/CAMeLBERT)獲取。

訓練數據

CA（古典阿拉伯語）：[OpenITI (Version 2020.1.2)](https://zenodo.org/record/3891466#.YEX4 - F0zbzc)

訓練過程

我們使用谷歌發佈的[原始實現](https://github.com/google - research/bert)進行預訓練。除非另有說明，我們遵循原始英文BERT模型的超參數進行預訓練。

預處理

從每個語料庫中提取原始文本後，我們進行以下預處理步驟：
- 首先，使用[原始BERT實現](https://github.com/google - research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286 - L297)提供的工具移除無效字符並規範化空格。
- 移除不包含任何阿拉伯字符的行。
- 使用[CAMeL Tools](https://github.com/CAMeL - Lab/camel_tools)移除變音符號和連字符。
- 最後，使用基於啟發式的句子分割器將每行分割成句子。
- 使用HuggingFace的tokenizers在整個數據集（167GB文本）上訓練一個詞彙量為30,000的WordPiece分詞器。
- 不將字母小寫，也不移除重音符號。

預訓練

模型在單個雲TPU (v3 - 8) 上總共訓練了100萬步。
前90,000步使用的批量大小為1,024，其餘步驟使用的批量大小為256。
90%的步驟中序列長度限制為128個標記，其餘10%的步驟中序列長度限制為512個標記。
使用全詞掩碼，重複因子為10。
對於最大序列長度為128個標記的數據集，每個序列的最大預測數設置為20；對於最大序列長度為512個標記的數據集，每個序列的最大預測數設置為80。
使用隨機種子12345，掩碼語言模型概率為0.15，短序列概率為0.1。
使用的優化器是Adam，學習率為1e - 4，\(\beta_{1} = 0.9\)，\(\beta_{2} = 0.999\)，權重衰減為0.01，學習率在10,000步內進行熱身，之後線性衰減。

評估結果

我們在五個NLP任務上評估了預訓練語言模型：命名實體識別（NER）、詞性標註（POS tagging）、情感分析（SA）、方言識別（DID）和詩歌分類。
使用12個數據集對模型進行微調並評估。
使用Hugging Face的transformers庫對CAMeLBERT模型進行微調。
使用transformers v3.1.0和PyTorch v1.5.1。
微調通過在最後一個隱藏層添加一個全連接線性層來完成。
使用\(F_{1}\)分數作為所有任務的評估指標。
微調使用的代碼可在[這裡](https://github.com/CAMeL - Lab/CAMeLBERT)獲取。

結果

任務	數據集	變體	混合	CA	DA	MSA	MSA - 1/2	MSA - 1/4	MSA - 1/8	MSA - 1/16
NER	ANERcorp	MSA	80.8%	67.9%	74.1%	82.4%	82.0%	82.1%	82.6%	80.8%
POS	PATB (MSA)	MSA	98.1%	97.8%	97.7%	98.3%	98.2%	98.3%	98.2%	98.2%
	ARZTB (EGY)	DA	93.6%	92.3%	92.7%	93.6%	93.6%	93.7%	93.6%	93.6%
	Gumar (GLF)	DA	97.3%	97.7%	97.9%	97.9%	97.9%	97.9%	97.9%	97.9%
SA	ASTD	MSA	76.3%	69.4%	74.6%	76.9%	76.0%	76.8%	76.7%	75.3%
	ArSAS	MSA	92.7%	89.4%	91.8%	93.0%	92.6%	92.5%	92.5%	92.3%
	SemEval	MSA	69.0%	58.5%	68.4%	72.1%	70.7%	72.8%	71.6%	71.2%
DID	MADAR - 26	DA	62.9%	61.9%	61.8%	62.6%	62.0%	62.8%	62.0%	62.2%
	MADAR - 6	DA	92.5%	91.5%	92.2%	91.9%	91.8%	92.2%	92.1%	92.0%
	MADAR - Twitter - 5	MSA	75.7%	71.4%	74.2%	77.6%	78.5%	77.3%	77.7%	76.2%
	NADI	DA	24.7%	17.3%	20.1%	24.9%	24.6%	24.6%	24.9%	23.8%
詩歌	APCD	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%

結果（平均值）

	變體	混合	CA	DA	MSA	MSA - 1/2	MSA - 1/4	MSA - 1/8	MSA - 1/16
變體平均^{[[1]](#footnote - 1)}	MSA	82.1%	75.7%	80.1%	83.4%	83.0%	83.3%	83.2%	82.3%
	DA	74.4%	72.1%	72.9%	74.2%	74.0%	74.3%	74.1%	73.9%
	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%
宏平均	ALL	78.7%	74.7%	77.1%	79.2%	79.0%	79.2%	79.1%	78.6%

[1]：變體平均是指對同一語言變體的一組任務進行平均。

🔧 技術細節

預訓練實現

使用谷歌發佈的[原始實現](https://github.com/google - research/bert)進行預訓練，遵循原始英文BERT模型的超參數，除非另有說明。

數據處理

在預處理階段，對原始文本進行了多步處理，包括移除無效字符、規範化空格、移除無阿拉伯字符的行、移除變音符號和連字符、句子分割以及訓練WordPiece分詞器等操作。

預訓練參數

在單個雲TPU (v3 - 8) 上進行訓練，設置了不同的批量大小、序列長度、掩碼策略、優化器參數等。

📄 許可證

本項目使用Apache - 2.0許可證。

致謝

本研究得到了谷歌TensorFlow研究雲（TFRC）提供的雲TPU支持。

引用

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}