🚀 Arabic GPT2
A pre - trained Transformer for Arabic language generation, trained on a large Arabic dataset.
You can find more information in our paper AraGPT2
The code in this repository was used to train all GPT2 variants. It supports training and fine - tuning GPT2 on GPUs and TPUs via the TPUEstimator API.
GPT2 - base and medium use the code from the gpt2
folder and can train models from the minimaxir/gpt - 2 - simple repository. These models were trained using the lamb
optimizer, follow the same architecture as gpt2
, and are fully compatible with the transformers
library.
GPT2 - large and GPT2 - mega were trained using the imcaspar/gpt2 - ml library and follow the grover
architecture. You can use the PyTorch classes found in grover/modeling_gpt2.py
as a direct replacement for classes in the transformers
library (it should support version v4.x
from transformers
). Both models are trained using the adafactor
optimizer, since the adam
and lamb
optimizers use too much memory, causing the model to not even fit 1 batch on a TPU core.
AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.
🚀 Quick Start
📦 Installation
The installation mainly involves setting up the necessary libraries and dependencies. For using the model with transformers
, you need to install the transformers
library. For using our custom code, you need to have TensorFlow 1.15.4 installed.
💻 Usage Examples
Basic Usage
Testing the model using transformers
:
from transformers import GPT2TokenizerFast, pipeline
from transformers import GPT2LMHeadModel
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
from arabert.preprocess import ArabertPreprocessor
MODEL_NAME='aubmindlab/aragpt2-base'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
text=""
text_clean = arabert_prep.preprocess(text)
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)
generation_pipeline(text,
pad_token_id=tokenizer.eos_token_id,
num_beams=10,
max_length=200,
top_p=0.9,
repetition_penalty = 3.0,
no_repeat_ngram_size = 3)[0]['generated_text']
Advanced Usage
python create_pretraining_data.py
--input_file=<RAW TEXT FILE with documents/article separated by an empty line>
--output_file=<OUTPUT TFRecord>
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
Finetuning:
python3 run_pretraining.py \\r\n --input_file="gs://<GS_BUCKET>/pretraining_data/*" \\r\n --output_dir="gs://<GS_BUCKET>/pretraining_model/" \\r\n --config_file="config/small_hparams.json" \\r\n --batch_size=128 \\r\n --eval_batch_size=8 \\r\n --num_train_steps= \\r\n --num_warmup_steps= \\r\n --learning_rate= \\r\n --save_checkpoints_steps= \\r\n --max_seq_length=1024 \\r\n --max_eval_steps= \\r\n --optimizer="lamb" \\r\n --iterations_per_loop=5000 \\r\n --keep_checkpoint_max=10 \\r\n --use_tpu=True \\r\n --tpu_name=<TPU NAME> \\r\n --do_train=True \\r\n --do_eval=False
📚 Documentation
Model Sizes
Property |
Details |
Model |
AraGPT2-base, AraGPT2-medium, AraGPT2-large, AraGPT2-mega |
Optimizer |
lamb for base and medium; adafactor for large and mega |
Context size |
1024 |
Embedding Size |
768 (base), 1024 (medium), 1280 (large), 1536 (mega) |
Num of heads |
12 (base), 16 (medium), 20 (large), 25 (mega) |
Num of layers |
12 (base), 24 (medium), 36 (large), 48 (mega) |
Model Size / Num of Params |
527MB / 135M (base), 1.38G/370M (medium), 2.98GB/792M (large), 5.5GB/1.46B (mega) |
All models are available in the HuggingFace
model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2, and TF1 formats.
Compute
Property |
Details |
Model |
AraGPT2-base, AraGPT2-medium, AraGPT2-large, AraGPT2-mega |
Hardware |
TPUv3 - 128 (base, large, mega); TPUv3 - 8 (medium) |
num of examples (seq len = 1024) |
9.7M |
Batch Size |
1792 (base), 1152 (medium), 256 (large, mega) |
Num of Steps |
125K (base), 85K (medium), 220k (large), 780K (mega) |
Time (in days) |
1.5 (base, medium), 3 (large), 9 (mega) |
Dataset
The pretraining data used for the new AraGPT2 model is also used for AraBERTv2 and AraELECTRA.
The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
For the new dataset, we added the unshuffled OSCAR corpus after thoroughly filtering it to the dataset used in AraBERTv1 but without the websites that we previously crawled:
Disclaimer
The text generated by AraGPT2 is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by AraGPT2 should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.
Citation
If you used this model, please cite us as:
@inproceedings{antoun-etal-2021-aragpt2,
title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
author = "Antoun, Wissam and
Baly, Fady and
Hajj, Hazem",
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
month = apr,
year = "2021",
address = "Kyiv, Ukraine (Virtual)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
pages = "196--207",
}
Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs. We couldn't have done it without this program. Thanks also to the AUB MIND Lab Members for the continuous support. Also, thanks to Yakshof and Assafir for data and storage access. Another thanks to Habib Rahal (https://www.behance.net/rahalhabib) for putting a face to AraBERT.
Contacts