Bert-base-arabic-camelbert-msa Open-source Model - Facilitating the Processing of Modern Standard Arabic NLP Tasks

Bert Base Arabic Camelbert Msa

Developed by CAMeL-Lab

CAMeLBERT is a collection of pre-trained models for Arabic NLP tasks. This model is the Modern Standard Arabic (MSA) variant, trained on 12.6 billion tokens

Large Language Model ArabicOpen Source License:Apache-2.0 #Arabic NLP #Multi-task fine-tuning #Modern Standard Arabic

Downloads 1,212

Release Time : 3/2/2022

Model Overview

BERT model pre-trained on Modern Standard Arabic text, supporting masked language modeling and fine-tuning for downstream NLP tasks

Model Features

Multi-dialect support

Provides dedicated models for Classical Arabic (CA), Dialectal Arabic (DA), and Modern Standard Arabic (MSA) variants

Scalable data size

Offers pre-trained models ranging from full data to 1/16 data sizes to accommodate different computational needs

Specialized preprocessing

Employs Arabic-specific preprocessing pipeline including diacritic handling and character normalization

Model Capabilities

Arabic text understanding

Masked language modeling

Named entity recognition

POS tagging

Sentiment analysis

Dialect identification

Use Cases

Text analysis

Arabic news classification

Topic classification for MSA news texts

Achieved 93% F1 score on ArSAS dataset

Linguistic research

Classical poetry classification

Identifying period and style of classical Arabic poetry

80.9% accuracy on APCD dataset (best for CA variant)

🚀 CAMeLBERT: A collection of pre-trained models for Arabic NLP tasks

CAMeLBERT is a collection of pre - trained BERT models on Arabic texts, available in different sizes and variants. It offers pre - trained language models for Modern Standard Arabic (MSA), dialectal Arabic (DA), and classical Arabic (CA), along with a model pre - trained on a mix of these three. Additionally, there are models pre - trained on scaled - down sets of the MSA variant (half, quarter, eighth, and sixteenth). The details are described in the paper "The Interplay of Variant, Size, and Task Type in Arabic Pre - trained Language Models."

This model card focuses on CAMeLBERT - MSA (bert - base - arabic - camelbert - msa), a model pre - trained on the entire MSA dataset.

🚀 Quick Start

Model Description

CAMeLBERT consists of a series of BERT models pre - trained on Arabic texts. The models vary in size and variant, covering different forms of Arabic such as Modern Standard Arabic, dialectal Arabic, and classical Arabic.

Property	Details
Model Type	A collection of pre - trained BERT models for Arabic NLP tasks
Training Data	See the "Training data" section below

Intended Uses

The released models can be used for masked language modeling or next sentence prediction. However, they are mainly designed to be fine - tuned on various NLP tasks, including Named Entity Recognition (NER), Part - of - Speech (POS) tagging, sentiment analysis, dialect identification, and poetry classification. The fine - tuning code is available [here](https://github.com/CAMeL - Lab/CAMeLBERT).

How to Use

Basic Usage

# Use the model with a pipeline for masked language modeling
from transformers import pipeline
unmasker = pipeline('fill - mask', model='CAMeL - Lab/bert - base - arabic - camelbert - msa')
unmasker("الهدف من الحياة هو [MASK] .")

Note: To download the models, you need transformers>=3.5.0. Otherwise, you can download them manually.

Advanced Usage in PyTorch

# Get the features of a given text in PyTorch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL - Lab/bert - base - arabic - camelbert - msa')
model = AutoModel.from_pretrained('CAMeL - Lab/bert - base - arabic - camelbert - msa')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Advanced Usage in TensorFlow

# Get the features of a given text in TensorFlow
from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL - Lab/bert - base - arabic - camelbert - msa')
model = TFAutoModel.from_pretrained('CAMeL - Lab/bert - base - arabic - camelbert - msa')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

✨ Features

Multiple Variants: Pre - trained models for different Arabic variants (MSA, DA, CA, and a mix).
Scaled - down Models: Additional models pre - trained on scaled - down MSA datasets.
Versatile Use: Suitable for multiple NLP tasks after fine - tuning.

📦 Installation

To use the models, you need to have transformers installed. You can install it using the following command:

pip install transformers>=3.5.0

📚 Documentation

Training Data

Modern Standard Arabic (MSA):
- The Arabic Gigaword Fifth Edition
- [Abu El - Khair Corpus](http://www.abuelkhair.net/index.php/en/arabic/abu - el - khair - corpus)
- OSIAN corpus
- [Arabic Wikipedia](https://archive.org/details/arwiki - 20190201)
- The unshuffled version of the Arabic [OSCAR corpus](https://oscar - corpus.com/)

Training Procedure

Preprocessing

Extract raw text from each corpus and apply pre - processing steps.
- Remove invalid characters and normalize white spaces using utilities from [the original BERT implementation](https://github.com/google - research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286 - L297).
- Remove lines without Arabic characters.
- Remove diacritics and kashida using [CAMeL Tools](https://github.com/CAMeL - Lab/camel_tools).
- Split each line into sentences using a heuristics - based sentence segmenter.
- Train a WordPiece tokenizer on the entire 167 GB text dataset with a vocabulary size of 30,000 using HuggingFace's tokenizers.
- Do not lowercase letters nor strip accents.

Pre - training

Train the model on a single cloud TPU (v3 - 8) for one million steps.
- First 90,000 steps with a batch size of 1,024, and the rest with a batch size of 256.
- Limit the sequence length to 128 tokens for 90% of the steps and 512 for the remaining 10%.
- Use whole word masking and a duplicate factor of 10.
- Set max predictions per sequence to 20 for the dataset with max sequence length of 128 tokens and 80 for the dataset with max sequence length of 512 tokens.
- Use a random seed of 12345, masked language model probability of 0.15, and short sequence probability of 0.1.
- Use Adam optimizer with a learning rate of 1e - 4, \(\beta_{1}=0.9\) and \(\beta_{2}=0.999\), a weight decay of 0.01, learning rate warm - up for 10,000 steps and linear decay of the learning rate after.

Evaluation Results

Evaluate the pre - trained language models on five NLP tasks (NER, POS tagging, sentiment analysis, dialect identification, and poetry classification) using 12 datasets.
Fine - tune and evaluate the models using Hugging Face's transformers (v3.1.0) along with PyTorch (v1.5.1).
Add a fully connected linear layer to the last hidden state for fine - tuning.
Use \(F_{1}\) score as the evaluation metric for all tasks.
The fine - tuning code is available [here](https://github.com/CAMeL - Lab/CAMeLBERT).

Results

Task	Dataset	Variant	Mix	CA	DA	MSA	MSA - 1/2	MSA - 1/4	MSA - 1/8	MSA - 1/16
NER	ANERcorp	MSA	80.8%	67.9%	74.1%	82.4%	82.0%	82.1%	82.6%	80.8%
POS	PATB (MSA)	MSA	98.1%	97.8%	97.7%	98.3%	98.2%	98.3%	98.2%	98.2%
	ARZTB (EGY)	DA	93.6%	92.3%	92.7%	93.6%	93.6%	93.7%	93.6%	93.6%
	Gumar (GLF)	DA	97.3%	97.7%	97.9%	97.9%	97.9%	97.9%	97.9%	97.9%
SA	ASTD	MSA	76.3%	69.4%	74.6%	76.9%	76.0%	76.8%	76.7%	75.3%
	ArSAS	MSA	92.7%	89.4%	91.8%	93.0%	92.6%	92.5%	92.5%	92.3%
	SemEval	MSA	69.0%	58.5%	68.4%	72.1%	70.7%	72.8%	71.6%	71.2%
DID	MADAR - 26	DA	62.9%	61.9%	61.8%	62.6%	62.0%	62.8%	62.0%	62.2%
	MADAR - 6	DA	92.5%	91.5%	92.2%	91.9%	91.8%	92.2%	92.1%	92.0%
	MADAR - Twitter - 5	MSA	75.7%	71.4%	74.2%	77.6%	78.5%	77.3%	77.7%	76.2%
	NADI	DA	24.7%	17.3%	20.1%	24.9%	24.6%	24.6%	24.9%	23.8%
Poetry	APCD	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%

Results (Average)

	Variant	Mix	CA	DA	MSA	MSA - 1/2	MSA - 1/4	MSA - 1/8	MSA - 1/16
Variant - wise - average^{[[1]](#footnote - 1)}	MSA	82.1%	75.7%	80.1%	83.4%	83.0%	83.3%	83.2%	82.3%
	DA	74.4%	72.1%	72.9%	74.2%	74.0%	74.3%	74.1%	73.9%
	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%
Macro - Average	ALL	78.7%	74.7%	77.1%	79.2%	79.0%	79.2%	79.1%	78.6%

[1]: Variant - wise - average refers to average over a group of tasks in the same language variant.

🔧 Technical Details

The pre - training process uses [the original implementation](https://github.com/google - research/bert) released by Google. It follows the hyperparameters of the original English BERT model, with specific settings for batch size, sequence length, masking, and optimization as described in the "Training procedure" section.

📄 License

The project is licensed under the Apache - 2.0 license.

Acknowledgements

This research was supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).

Citation

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご