The open-source bert-base-arabic-camelbert-msa-sixteenth model

Bert Base Arabic Camelbert Msa Sixteenth

Developed by CAMeL-Lab

Pretrained model for Arabic NLP tasks, trained on a reduced-scale (1/16) Modern Standard Arabic (MSA) dataset

Large Language Model ArabicOpen Source License:Apache-2.0 #Arabic NLP #Lightweight Pretraining #Variant Adaptation

Downloads 215

Release Time : 3/2/2022

Model Overview

Arabic pretrained model based on BERT architecture, focused on Modern Standard Arabic processing, suitable for fine-tuning various NLP tasks

Model Features

Variant Focus

Specifically optimized for Modern Standard Arabic (MSA), more focused compared to mixed-variant models

Lightweight Pretraining

Pretrained on 1/16 scale of the full MSA dataset, suitable for resource-limited scenarios

Multi-task Adaptation

Designed for fine-tuning on various downstream tasks like NER, POS tagging, sentiment analysis

Model Capabilities

Arabic text understanding

Masked language modeling

Next sentence prediction

Downstream task fine-tuning

Use Cases

Natural Language Processing

Named Entity Recognition

Identify entities such as person names and locations in Arabic text

Maintains F1 score above ~80% on NER tasks

Sentiment Analysis

Analyze sentiment tendencies in Arabic text

Linguistic Research

Classical Arabic Analysis

Used for grammatical and syntactic studies of Classical Arabic texts

🚀 CAMeLBERT: A collection of pre-trained models for Arabic NLP tasks

CAMeLBERT is a collection of BERT models pre-trained on Arabic texts. It offers models of different sizes and variants, including those for Modern Standard Arabic (MSA), dialectal Arabic (DA), classical Arabic (CA), and a mixed model. Additionally, there are models pre - trained on scaled - down MSA datasets. This model card focuses on CAMeLBERT-MSA-sixteenth (bert-base-arabic-camelbert-msa-sixteenth), pre - trained on a sixteenth of the full MSA dataset.

🚀 Quick Start

You can use the released model for masked language modeling or next sentence prediction. It is mainly designed to be fine - tuned on NLP tasks like NER, POS tagging, sentiment analysis, dialect identification, and poetry classification. The fine - tuning code is available here.

Usage Examples

Basic Usage

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-arabic-camelbert-msa-sixteenth')
>>> unmasker("الهدف من الحياة هو [MASK] .")
[{'sequence': '[CLS] الهدف من الحياة هو التغيير. [SEP]',
  'score': 0.08320745080709457,
  'token': 7946,
  'token_str': 'التغيير'},
 {'sequence': '[CLS] الهدف من الحياة هو التعلم. [SEP]',
  'score': 0.04305094853043556,
  'token': 12554,
  'token_str': 'التعلم'},
 {'sequence': '[CLS] الهدف من الحياة هو العمل. [SEP]',
  'score': 0.0417640283703804,
  'token': 2854,
  'token_str': 'العمل'},
 {'sequence': '[CLS] الهدف من الحياة هو الحياة. [SEP]',
  'score': 0.041371218860149384,
  'token': 3696,
  'token_str': 'الحياة'},
 {'sequence': '[CLS] الهدف من الحياة هو المعرفة. [SEP]',
  'score': 0.039794355630874634,
  'token': 7344,
  'token_str': 'المعرفة'}]

Note: to download our models, you need transformers>=3.5.0. Otherwise, you can download the models manually.

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa-sixteenth')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa-sixteenth')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

And in TensorFlow:

from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa-sixteenth')
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa-sixteenth')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

✨ Features

Multiple Variants: Offers models for different Arabic variants (MSA, DA, CA) and a mixed model.
Scaled - down Models: Provides models pre - trained on scaled - down MSA datasets.

📦 Installation

To use the models, you need to have transformers>=3.5.0 installed. You can install it using the following command:

pip install transformers>=3.5.0

📚 Documentation

Model description

CAMeLBERT consists of BERT models pre - trained on Arabic texts. The details are described in the paper "The Interplay of Variant, Size, and Task Type in Arabic Pre - trained Language Models."

Property	Details
Model Type	BERT - based pre - trained models for Arabic NLP
Training Data	See "Training data" section

Intended uses

The model can be used for masked language modeling, next sentence prediction, and fine - tuned on various NLP tasks.

Training data

MSA (Modern Standard Arabic)
- The Arabic Gigaword Fifth Edition
- Abu El - Khair Corpus
- OSIAN corpus
- Arabic Wikipedia
- The unshuffled version of the Arabic [OSCAR corpus](https://oscar - corpus.com/)

Training procedure

We use [the original implementation](https://github.com/google - research/bert) released by Google for pre - training and follow the original English BERT model's hyperparameters, unless otherwise specified.

Preprocessing

After extracting raw text from each corpus, we perform the following pre - processing:
- Remove invalid characters and normalize white spaces using utilities from [the original BERT implementation](https://github.com/google - research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286 - L297).
- Remove lines without Arabic characters.
- Remove diacritics and kashida using [CAMeL Tools](https://github.com/CAMeL - Lab/camel_tools).
- Split each line into sentences using a heuristics - based sentence segmenter.
- Train a WordPiece tokenizer on the entire dataset (167 GB text) with a vocabulary size of 30,000 using HuggingFace's tokenizers.
- Do not lowercase letters nor strip accents.

Pre - training

The model was trained on a single cloud TPU (v3 - 8) for one million steps.
The first 90,000 steps were trained with a batch size of 1,024, and the rest with a batch size of 256.
The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%.
We use whole word masking and a duplicate factor of 10.
Set max predictions per sequence to 20 for the dataset with max sequence length of 128 tokens and 80 for the dataset with max sequence length of 512 tokens.
Use a random seed of 12345, masked language model probability of 0.15, and short sequence probability of 0.1.
The optimizer is Adam with a learning rate of 1e - 4, \(\beta_{1}=0.9\) and \(\beta_{2}=0.999\), a weight decay of 0.01, learning rate warm - up for 10,000 steps and linear decay of the learning rate after.

Evaluation results

We evaluate the pre - trained language models on five NLP tasks using 12 datasets. We use Hugging Face's transformers v3.1.0 along with PyTorch v1.5.1 for fine - tuning. The fine - tuning is done by adding a fully connected linear layer to the last hidden state, and we use \(F_{1}\) score as a metric for all tasks. The fine - tuning code is available [here](https://github.com/CAMeL - Lab/CAMeLBERT).

Task	Dataset	Variant	Mix	CA	DA	MSA	MSA-1/2	MSA-1/4	MSA-1/8	MSA-1/16
NER	ANERcorp	MSA	80.8%	67.9%	74.1%	82.4%	82.0%	82.1%	82.6%	80.8%
POS	PATB (MSA)	MSA	98.1%	97.8%	97.7%	98.3%	98.2%	98.3%	98.2%	98.2%
	ARZTB (EGY)	DA	93.6%	92.3%	92.7%	93.6%	93.6%	93.7%	93.6%	93.6%
	Gumar (GLF)	DA	97.3%	97.7%	97.9%	97.9%	97.9%	97.9%	97.9%	97.9%
SA	ASTD	MSA	76.3%	69.4%	74.6%	76.9%	76.0%	76.8%	76.7%	75.3%
	ArSAS	MSA	92.7%	89.4%	91.8%	93.0%	92.6%	92.5%	92.5%	92.3%
	SemEval	MSA	69.0%	58.5%	68.4%	72.1%	70.7%	72.8%	71.6%	71.2%
DID	MADAR-26	DA	62.9%	61.9%	61.8%	62.6%	62.0%	62.8%	62.0%	62.2%
	MADAR-6	DA	92.5%	91.5%	92.2%	91.9%	91.8%	92.2%	92.1%	92.0%
	MADAR-Twitter-5	MSA	75.7%	71.4%	74.2%	77.6%	78.5%	77.3%	77.7%	76.2%
	NADI	DA	24.7%	17.3%	20.1%	24.9%	24.6%	24.6%	24.9%	23.8%
Poetry	APCD	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%

Results (Average)

	Variant	Mix	CA	DA	MSA	MSA-1/2	MSA-1/4	MSA-1/8	MSA-1/16
Variant-wise-average^[1]	MSA	82.1%	75.7%	80.1%	83.4%	83.0%	83.3%	83.2%	82.3%
	DA	74.4%	72.1%	72.9%	74.2%	74.0%	74.3%	74.1%	73.9%
	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%
Macro-Average	ALL	78.7%	74.7%	77.1%	79.2%	79.0%	79.2%	79.1%	78.6%

[1]: Variant-wise-average refers to average over a group of tasks in the same language variant.

🔧 Technical Details

Pre - training implementation: We use [the original implementation](https://github.com/google - research/bert) released by Google.
Hyperparameters: Follow the original English BERT model's hyperparameters, with specific settings for batch size, sequence length, masking, etc.

📄 License

This project is licensed under the Apache - 2.0 license.

Acknowledgements

This research was supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).

Citation

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご