aragpt2-mega Open-source Language Model - Free Deployment to Boost Arabic Content Generation Tasks

Aragpt2 Mega

Developed by aubmindlab

AraGPT2 is a series of large language models pre-trained for Arabic text generation tasks, available in four sizes: base, medium, large, and mega.

Large Language Model

Transformers

ArabicOpen Source License:Other #Arabic text generation #Large language model #Multiple sizes available

Downloads 998

Release Time : 3/2/2022

Model Overview

AraGPT2 is an Arabic text generation model based on the GPT2 architecture, trained on a large-scale Arabic dataset and supports tasks like text generation.

Model Features

Arabic language optimization

Specially optimized for Arabic text using the same large-scale dataset as AraBERTv2.

Multiple size options

Available in four sizes from base (135 million parameters) to mega (1.46 billion parameters).

TPU optimized training

Supports TPU training, with the mega model trained for 780,000 steps on TPUv3-128.

Transformers compatibility

Can be loaded and used via the HuggingFace Transformers library.

Model Capabilities

Arabic text generation

Text auto-completion

Language model fine-tuning

Use Cases

Content generation

Arabic article generation

Generates coherent Arabic articles based on prompts

Can produce long texts that conform to Arabic grammar and expression conventions.

Educational applications

Arabic learning assistance

Generates Arabic learning materials and exercises

🚀 Arabic GPT2

Arabic GPT2 is a pre - trained model for Arabic language generation. It offers multiple variants and is trained on a large Arabic dataset. The code in this repository supports training and fine - tuning on GPUs and TPUs.

You can find more information in our paper AraGPT2

The code in this repository was used to train all GPT2 variants. The code support training and fine - tuning GPT2 on GPUs and TPUs via the TPUEstimator API.

GPT2 - base and medium uses the code from the gpt2 folder and can trains models from the [minimaxir/gpt - 2 - simple](https://github.com/minimaxir/gpt - 2 - simple) repository. These models were trained using the lamb optimizer and follow the same architecture as gpt2 and are fully compatible with the transformers library.

GPT2 - large and GPT2 - mega were trained using the [imcaspar/gpt2 - ml](https://github.com/imcaspar/gpt2 - ml/) library, and follow the grover architecture. You can use the pytorch classes found in grover/modeling_gpt2.py as a direct replacement for classes in the transformers library (it should support version v4.x from transformers). Both models are trained using the adafactor optimizer, since the adam and lamb optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.

AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.

📄 License

The model is under a custom license. You can find more details [here](https://github.com/aub - mind/arabert/blob/master/aragpt2/LICENSE).

🚀 Quick Start

Testing the model using `transformers`

The model code is now hosted on HuggingFace so you need to use the trust_remote_code flag, and can be used as follows:

from transformers import AutoModelForCausalLM, pipeline
from arabert.preprocess import ArabertPreprocessor

MODEL_NAME='aubmindlab/aragpt2 - mega'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)

text=""
text_clean = arabert_prep.preprocess(text)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline(
    "text - generation", model=MODEL_NAME, trust_remote_code=True
)

#feel free to try different decoding settings
generation_pipeline(text,
    pad_token_id=pipeline.tokenizer.eos_token_id,
    num_beams=10,
    max_length=200,
    top_p=0.9,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 3)[0]['generated_text']
>>>

Finetunning using `transformers`

Follow the guide linked [here](https://towardsdatascience.com/fine - tuning - gpt2 - on - colab - gpu - for - free - 340468c92ed)

Finetuning using our code with TF 1.15.4

Create the Training TFRecords

python create_pretraining_data.py
 --input_file=<RAW TEXT FILE with documents/article separated by an empty line>
 --output_file=<OUTPUT TFRecord>
 --tokenizer_dir=<Directory with the GPT2 Tokenizer files>

Finetuning

python3 run_pretraining.py \
 --input_file="gs://<GS_BUCKET>/pretraining_data/*" \
 --output_dir="gs://<GS_BUCKET>/pretraining_model/" \
 --config_file="config/small_hparams.json" \
 --batch_size=128 \
 --eval_batch_size=8 \
 --num_train_steps= \
 --num_warmup_steps= \
 --learning_rate= \
 --save_checkpoints_steps= \
 --max_seq_length=1024 \
 --max_eval_steps= \
 --optimizer="lamb" \
 --iterations_per_loop=5000 \
 --keep_checkpoint_max=10 \
 --use_tpu=True \
 --tpu_name=<TPU NAME> \
 --do_train=True \
 --do_eval=False

📚 Documentation

Model Sizes

Model	Optimizer	Context size	Embedding Size	Num of heads	Num of layers	Model Size / Num of Params
AraGPT2 - base	`lamb`	1024	768	12	12	527MB/135M
AraGPT2 - medium	`lamb`	1024	1024	16	24	1.38G/370M
AraGPT2 - large	`adafactor`	1024	1280	20	36	2.98GB/792M
AraGPT2 - mega	`adafactor`	1024	1536	25	48	5.5GB/1.46B

All models are available in the HuggingFace model page under the aubmindlab name. Checkpoints are available in PyTorch, TF2 and TF1 formats.

Compute

Model	Hardware	num of examples (seq len = 1024)	Batch Size	Num of Steps	Time (in days)
AraGPT2 - base	TPUv3 - 128	9.7M	1792	125K	1.5
AraGPT2 - medium	TPUv3 - 8	9.7M	1152	85K	1.5
AraGPT2 - large	TPUv3 - 128	9.7M	256	220k	3
AraGPT2 - mega	TPUv3 - 128	9.7M	256	780K	9

Dataset

The pretraining data used for the new AraBERT model is also used for GPT2 and ELECTRA.

The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)

For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the previous dataset used in AraBERTv1 but with out the websites that we previously crawled:

OSCAR unshuffled and filtered.
[Arabic Wikipedia dump](https://archive.org/details/arwiki - 20190201) from 2020/09/01
[The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5 - billion - words - Arabic - Corpus - El - Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
[The OSIAN Corpus](https://www.aclweb.org/anthology/W19 - 4619)
Assafir news articles. Huge thank you for Assafir for giving us the data

⚠️ Important Note

The model expects the input to be preprocessed using the arabert library. If not, the model won't be able to generate the correct output.

📖 Citation

If you used this model please cite us as:

@inproceedings{antoun - etal - 2021 - aragpt2,
    title = "{A}ra{GPT}2: Pre - Trained Transformer for {A}rabic Language Generation",
    author = "Antoun, Wissam  and
      Baly, Fady  and
      Hajj, Hazem",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Virtual)",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.wanlp - 1.21",
    pages = "196--207",
}

🙏 Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

📞 Contacts

Wissam Antoun: [Linkedin](https://www.linkedin.com/in/wissam - antoun - 622142b4/) | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com
Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご