đ Arabic GPT2
Arabic GPT2 is a pre - trained model for Arabic language generation. It offers multiple variants and is trained on a large Arabic dataset. The code in this repository supports training and fine - tuning on GPUs and TPUs.
You can find more information in our paper AraGPT2
The code in this repository was used to train all GPT2 variants. The code support training and fine - tuning GPT2 on GPUs and TPUs via the TPUEstimator API.
GPT2 - base and medium uses the code from the gpt2
folder and can trains models from the [minimaxir/gpt - 2 - simple](https://github.com/minimaxir/gpt - 2 - simple) repository.
These models were trained using the lamb
optimizer and follow the same architecture as gpt2
and are fully compatible with the transformers
library.
GPT2 - large and GPT2 - mega were trained using the [imcaspar/gpt2 - ml](https://github.com/imcaspar/gpt2 - ml/) library, and follow the grover
architecture. You can use the pytorch classes found in grover/modeling_gpt2.py
as a direct replacement for classes in the transformers
library (it should support version v4.x
from transformers
).
Both models are trained using the adafactor
optimizer, since the adam
and lamb
optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.
AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.
đ License
The model is under a custom license. You can find more details [here](https://github.com/aub - mind/arabert/blob/master/aragpt2/LICENSE).
đ Quick Start
Testing the model using transformers
The model code is now hosted on HuggingFace so you need to use the trust_remote_code
flag, and can be used as follows:
from transformers import AutoModelForCausalLM, pipeline
from arabert.preprocess import ArabertPreprocessor
MODEL_NAME='aubmindlab/aragpt2 - mega'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
text=""
text_clean = arabert_prep.preprocess(text)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline(
"text - generation", model=MODEL_NAME, trust_remote_code=True
)
generation_pipeline(text,
pad_token_id=pipeline.tokenizer.eos_token_id,
num_beams=10,
max_length=200,
top_p=0.9,
repetition_penalty = 3.0,
no_repeat_ngram_size = 3)[0]['generated_text']
>>>
Finetunning using transformers
Follow the guide linked [here](https://towardsdatascience.com/fine - tuning - gpt2 - on - colab - gpu - for - free - 340468c92ed)
Finetuning using our code with TF 1.15.4
Create the Training TFRecords
python create_pretraining_data.py
--input_file=<RAW TEXT FILE with documents/article separated by an empty line>
--output_file=<OUTPUT TFRecord>
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
Finetuning
python3 run_pretraining.py \
--input_file="gs://<GS_BUCKET>/pretraining_data/*" \
--output_dir="gs://<GS_BUCKET>/pretraining_model/" \
--config_file="config/small_hparams.json" \
--batch_size=128 \
--eval_batch_size=8 \
--num_train_steps= \
--num_warmup_steps= \
--learning_rate= \
--save_checkpoints_steps= \
--max_seq_length=1024 \
--max_eval_steps= \
--optimizer="lamb" \
--iterations_per_loop=5000 \
--keep_checkpoint_max=10 \
--use_tpu=True \
--tpu_name=<TPU NAME> \
--do_train=True \
--do_eval=False
đ Documentation
Model Sizes
Model |
Optimizer |
Context size |
Embedding Size |
Num of heads |
Num of layers |
Model Size / Num of Params |
AraGPT2 - base |
lamb |
1024 |
768 |
12 |
12 |
527MB/135M |
AraGPT2 - medium |
lamb |
1024 |
1024 |
16 |
24 |
1.38G/370M |
AraGPT2 - large |
adafactor |
1024 |
1280 |
20 |
36 |
2.98GB/792M |
AraGPT2 - mega |
adafactor |
1024 |
1536 |
25 |
48 |
5.5GB/1.46B |
All models are available in the HuggingFace
model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats.
Compute
Model |
Hardware |
num of examples (seq len = 1024) |
Batch Size |
Num of Steps |
Time (in days) |
AraGPT2 - base |
TPUv3 - 128 |
9.7M |
1792 |
125K |
1.5 |
AraGPT2 - medium |
TPUv3 - 8 |
9.7M |
1152 |
85K |
1.5 |
AraGPT2 - large |
TPUv3 - 128 |
9.7M |
256 |
220k |
3 |
AraGPT2 - mega |
TPUv3 - 128 |
9.7M |
256 |
780K |
9 |
Dataset
The pretraining data used for the new AraBERT model is also used for GPT2 and ELECTRA.
The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled:
- OSCAR unshuffled and filtered.
- [Arabic Wikipedia dump](https://archive.org/details/arwiki - 20190201) from 2020/09/01
- [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5 - billion - words - Arabic - Corpus - El - Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
- [The OSIAN Corpus](https://www.aclweb.org/anthology/W19 - 4619)
- Assafir news articles. Huge thank you for Assafir for giving us the data
â ī¸ Important Note
The model expects the input to be preprocessed using the arabert
library. If not, the model won't be able to generate the correct output.
đ Citation
If you used this model please cite us as:
@inproceedings{antoun - etal - 2021 - aragpt2,
title = "{A}ra{GPT}2: Pre - Trained Transformer for {A}rabic Language Generation",
author = "Antoun, Wissam and
Baly, Fady and
Hajj, Hazem",
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
month = apr,
year = "2021",
address = "Kyiv, Ukraine (Virtual)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.wanlp - 1.21",
pages = "196--207",
}
đ Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
đ Contacts