AraGPT2-base Open Source Arabic Generation Model - Free Support for Content Creation in Multiple Scale Variants

Aragpt2 Base

Developed by aubmindlab

AraGPT2 is a Transformer-based Arabic text generation pre-trained model developed by AUB MIND Lab, supporting multiple model variants of different sizes.

Large Language Model Arabic#Arabic Text Generation #Large Language Model #Multi-domain Adaptation

Downloads 21.26k

Release Time : 3/2/2022

Model Overview

AraGPT2 is a GPT-2 model series specifically optimized for Arabic text generation tasks, including base, medium, large, and mega variants, supporting training and fine-tuning on both GPU and TPU.

Model Features

Multi-scale Models

Offers four model variants ranging from 135M parameters (base) to 1.46B parameters (mega) to meet different computational needs

Arabic Optimization

Specifically optimized for Arabic language characteristics, trained on 77GB of high-quality Arabic corpus

TPU/GPU Support

Supports training and fine-tuning on both GPU and TPU via TPUEstimator API

Transformers Compatibility

Base and medium versions are fully compatible with HuggingFace Transformers library, while large and mega versions can be adapted with wrapper classes

Model Capabilities

Arabic Text Generation

Text Auto-completion

Language Model Fine-tuning

Use Cases

Content Generation

News Writing Assistance

Generates news article snippets based on prompts

Produces coherent text adhering to Arabic grammar and style

Story Creation

Generates complete stories from opening prompts

Maintains narrative coherence and cultural relevance

Educational Applications

Language Learning

Generates Arabic learning materials and exercises

Provides customized content aligned with learning objectives

🚀 Arabic GPT2

A pre - trained Transformer for Arabic language generation, trained on a large Arabic dataset.

You can find more information in our paper AraGPT2

The code in this repository was used to train all GPT2 variants. It supports training and fine - tuning GPT2 on GPUs and TPUs via the TPUEstimator API.

GPT2 - base and medium use the code from the gpt2 folder and can train models from the minimaxir/gpt - 2 - simple repository. These models were trained using the lamb optimizer, follow the same architecture as gpt2, and are fully compatible with the transformers library.

GPT2 - large and GPT2 - mega were trained using the imcaspar/gpt2 - ml library and follow the grover architecture. You can use the PyTorch classes found in grover/modeling_gpt2.py as a direct replacement for classes in the transformers library (it should support version v4.x from transformers). Both models are trained using the adafactor optimizer, since the adam and lamb optimizers use too much memory, causing the model to not even fit 1 batch on a TPU core.

AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.

🚀 Quick Start

📦 Installation

The installation mainly involves setting up the necessary libraries and dependencies. For using the model with transformers, you need to install the transformers library. For using our custom code, you need to have TensorFlow 1.15.4 installed.

💻 Usage Examples

Basic Usage

Testing the model using transformers:

from transformers import GPT2TokenizerFast, pipeline
#for base and medium
from transformers import GPT2LMHeadModel
#for large and mega
# pip install arabert
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel

from arabert.preprocess import ArabertPreprocessor

MODEL_NAME='aubmindlab/aragpt2-base'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)

text=""
text_clean = arabert_prep.preprocess(text)

model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)

#feel free to try different decoding settings
generation_pipeline(text,
    pad_token_id=tokenizer.eos_token_id,
    num_beams=10,
    max_length=200,
    top_p=0.9,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 3)[0]['generated_text']

Advanced Usage

Finetuning using transformers: Follow the guide linked here
Finetuning using our code with TF 1.15.4: Create the Training TFRecords:

python create_pretraining_data.py
 --input_file=<RAW TEXT FILE with documents/article separated by an empty line>
 --output_file=<OUTPUT TFRecord>
 --tokenizer_dir=<Directory with the GPT2 Tokenizer files>

Finetuning:

python3 run_pretraining.py \\r\n --input_file="gs://<GS_BUCKET>/pretraining_data/*" \\r\n --output_dir="gs://<GS_BUCKET>/pretraining_model/" \\r\n --config_file="config/small_hparams.json" \\r\n --batch_size=128 \\r\n --eval_batch_size=8 \\r\n --num_train_steps= \\r\n --num_warmup_steps= \\r\n --learning_rate= \\r\n --save_checkpoints_steps= \\r\n --max_seq_length=1024 \\r\n --max_eval_steps= \\r\n --optimizer="lamb" \\r\n --iterations_per_loop=5000 \\r\n --keep_checkpoint_max=10 \\r\n --use_tpu=True \\r\n --tpu_name=<TPU NAME> \\r\n --do_train=True \\r\n --do_eval=False

📚 Documentation

Model Sizes

Property	Details
Model	AraGPT2-base, AraGPT2-medium, AraGPT2-large, AraGPT2-mega
Optimizer	`lamb` for base and medium; `adafactor` for large and mega
Context size	1024
Embedding Size	768 (base), 1024 (medium), 1280 (large), 1536 (mega)
Num of heads	12 (base), 16 (medium), 20 (large), 25 (mega)
Num of layers	12 (base), 24 (medium), 36 (large), 48 (mega)
Model Size / Num of Params	527MB / 135M (base), 1.38G/370M (medium), 2.98GB/792M (large), 5.5GB/1.46B (mega)

All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2, and TF1 formats.

Compute

Property	Details
Model	AraGPT2-base, AraGPT2-medium, AraGPT2-large, AraGPT2-mega
Hardware	TPUv3 - 128 (base, large, mega); TPUv3 - 8 (medium)
num of examples (seq len = 1024)	9.7M
Batch Size	1792 (base), 1152 (medium), 256 (large, mega)
Num of Steps	125K (base), 85K (medium), 220k (large), 780K (mega)
Time (in days)	1.5 (base, medium), 3 (large), 9 (mega)

Dataset

The pretraining data used for the new AraGPT2 model is also used for AraBERTv2 and AraELECTRA.

The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)

For the new dataset, we added the unshuffled OSCAR corpus after thoroughly filtering it to the dataset used in AraBERTv1 but without the websites that we previously crawled:

OSCAR unshuffled and filtered.
Arabic Wikipedia dump from 2020/09/01
The 1.5B words Arabic Corpus
The OSIAN Corpus
Assafir news articles. Huge thank you for Assafir for giving us the data

Disclaimer

The text generated by AraGPT2 is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by AraGPT2 should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.

Citation

If you used this model, please cite us as:

@inproceedings{antoun-etal-2021-aragpt2,
    title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
    author = "Antoun, Wissam  and
      Baly, Fady  and
      Hajj, Hazem",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Virtual)",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
    pages = "196--207",
}

Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs. We couldn't have done it without this program. Thanks also to the AUB MIND Lab Members for the continuous support. Also, thanks to Yakshof and Assafir for data and storage access. Another thanks to Habib Rahal (https://www.behance.net/rahalhabib) for putting a face to AraBERT.

Contacts

Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com
Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご