đ Arabic-ALBERT Base
An Arabic edition of the ALBERT Base pretrained language model, designed to empower Arabic NLP tasks with state - of - the - art performance.
đ Quick Start
To use these models, you need to install torch
or tensorflow
along with the Huggingface library transformers
. Then, you can initialize the model as follows:
from transformers import AutoTokenizer, AutoModel
base_tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")
base_model = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-base-arabic")
⨠Features
- Pretrained on a large Arabic corpus, including data from OSCAR and Wikipedia.
- Supports various NLP tasks due to its masked - language - model architecture.
- Multiple model sizes (base, large, xlarge) are available to suit different needs.
đĻ Installation
You need to install the following libraries to use these models:
torch
or tensorflow
- Huggingface
transformers
library
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
base_tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")
base_model = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-base-arabic")
Advanced Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM, TrainingArguments, Trainer
import torch
tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")
model = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-base-arabic")
texts = ["This is an example sentence.", "Another example for demonstration."]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=torch.utils.data.TensorDataset(inputs['input_ids'], inputs['attention_mask'])
)
trainer.train()
đ Documentation
Pretraining data
The models were pretrained on approximately 4.4 Billion words:
Notes on training data:
- Our final version of the corpus contains some non - Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
- Although non - Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not restricted to Modern Standard Arabic; they contain some dialectical Arabic too.
Pretraining details
- These models were trained using Google ALBERT's github repository on a single TPU v3 - 8 provided for free from TFRC.
- Our pretraining procedure follows the training settings of BERT with some changes: trained for 7M training steps with a batch size of 64, instead of 125K with a batch size of 4096.
Models
Property |
albert - base |
albert - large |
albert - xlarge |
Hidden Layers |
12 |
24 |
24 |
Attention heads |
12 |
16 |
32 |
Hidden size |
768 |
1024 |
2048 |
Results
For further details on the models' performance or any other queries, please refer to [Arabic - ALBERT](https://github.com/KUIS - AI - Lab/Arabic - ALBERT/)
đ§ Technical Details
Training Environment
- These models were trained using Google ALBERT's github repository on a single TPU v3 - 8 provided for free from TFRC.
Training Procedure
- Our pretraining procedure follows the training settings of BERT with some changes: trained for 7M training steps with a batch size of 64, instead of 125K with a batch size of 4096.
đ License
If you use any of these models in your work, please cite this work as:
@software{ali_safaya_2020_4718724,
author = {Ali Safaya},
title = {Arabic-ALBERT},
month = aug,
year = 2020,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.4718724},
url = {https://doi.org/10.5281/zenodo.4718724}
}
đĄ Usage Tip
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers đ