albert-base-arabic Open-source Pre-trained Language Model - Supports Standard Arabic and Some Dialects

Albert Base Arabic

Developed by asafaya

Arabic ALBERT Base is a pretrained language model trained on approximately 4.4 billion words of Arabic data, supporting Modern Standard Arabic and some dialectal content.

Large Language Model

Transformers

Arabic#Arabic Pretraining #Masked Language Modeling #Multi-Dialect Support

Downloads 35

Release Time : 3/2/2022

Model Overview

This model is an Arabic pretrained language model based on the ALBERT architecture, suitable for natural language processing tasks such as text classification and named entity recognition.

Model Features

Multi-Source Data Training

The model is trained on OSCAR Arabic and Wikipedia data, covering Modern Standard Arabic and some dialectal content.

Optimized Training Parameters

Adjusted training steps and batch size, using 7 million training steps (batch size=64) to improve performance.

Retention of Non-Arabic Vocabulary

Non-Arabic vocabulary in sentences is preserved during preprocessing to enhance the effectiveness of tasks such as NER.

Model Capabilities

Text Classification

Named Entity Recognition

Language Modeling

Use Cases

Natural Language Processing

Named Entity Recognition

Identify named entities in Arabic text, such as person names and locations.

Text Classification

Classify Arabic text, such as sentiment analysis and topic classification.

🚀 Arabic-ALBERT Base

An Arabic edition of the ALBERT Base pretrained language model, designed to empower Arabic NLP tasks with state - of - the - art performance.

🚀 Quick Start

To use these models, you need to install torch or tensorflow along with the Huggingface library transformers. Then, you can initialize the model as follows:

from transformers import AutoTokenizer, AutoModel

# loading the tokenizer
base_tokenizer    = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")

# loading the model
base_model   = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-base-arabic")

✨ Features

Pretrained on a large Arabic corpus, including data from OSCAR and Wikipedia.
Supports various NLP tasks due to its masked - language - model architecture.
Multiple model sizes (base, large, xlarge) are available to suit different needs.

📦 Installation

You need to install the following libraries to use these models:

torch or tensorflow
Huggingface transformers library

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

# loading the tokenizer
base_tokenizer    = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")

# loading the model
base_model   = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-base-arabic")

Advanced Usage

# For more advanced scenarios, you can fine - tune the model on your own dataset.
# First, import necessary libraries
from transformers import AutoTokenizer, AutoModelForMaskedLM, TrainingArguments, Trainer
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")
model = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-base-arabic")

# Prepare your dataset (example: a simple list of texts)
texts = ["This is an example sentence.", "Another example for demonstration."]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=torch.utils.data.TensorDataset(inputs['input_ids'], inputs['attention_mask'])
)

# Start training
trainer.train()

📚 Documentation

Pretraining data

The models were pretrained on approximately 4.4 Billion words:

The Arabic version of OSCAR (unshuffled version of the corpus) - filtered from Common Crawl
A recent dump of Arabic Wikipedia

Notes on training data:

Our final version of the corpus contains some non - Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
Although non - Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
The corpus and vocabulary set are not restricted to Modern Standard Arabic; they contain some dialectical Arabic too.

Pretraining details

These models were trained using Google ALBERT's github repository on a single TPU v3 - 8 provided for free from TFRC.
Our pretraining procedure follows the training settings of BERT with some changes: trained for 7M training steps with a batch size of 64, instead of 125K with a batch size of 4096.

Models

Property	albert - base	albert - large	albert - xlarge
Hidden Layers	12	24	24
Attention heads	12	16	32
Hidden size	768	1024	2048

Results

For further details on the models' performance or any other queries, please refer to [Arabic - ALBERT](https://github.com/KUIS - AI - Lab/Arabic - ALBERT/)

🔧 Technical Details

Training Environment

These models were trained using Google ALBERT's github repository on a single TPU v3 - 8 provided for free from TFRC.

Training Procedure

Our pretraining procedure follows the training settings of BERT with some changes: trained for 7M training steps with a batch size of 64, instead of 125K with a batch size of 4096.

📄 License

If you use any of these models in your work, please cite this work as:

@software{ali_safaya_2020_4718724,
  author       = {Ali Safaya},
  title        = {Arabic-ALBERT},
  month        = aug,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.4718724},
  url          = {https://doi.org/10.5281/zenodo.4718724}
}

💡 Usage Tip

Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご