AraT5-base-title-generation Open Source Model - Free Implementation of Multiple Arabic Version Text Generation

Arat5 Base Title Generation

Developed by UBC-NLP

AraT5 is a series of text generation models specifically designed for Arabic, including versions for Modern Standard Arabic, Twitter dialect, and a general version

Large Language Model

Transformers

Arabic#Arabic news headline generation #Multi-dialect Arabic processing #T5 architecture optimization

Downloads 117

Release Time : 3/2/2022

Model Overview

A Transformer-based text generation model specialized for Arabic, supporting various tasks such as news headline generation, text summarization, and machine translation

Model Features

Multi-domain adaptation

Provides dedicated versions for Modern Standard Arabic, Twitter dialect, and a general version

Multi-task support

Supports various text generation tasks such as headline generation, text summarization, machine translation, and paraphrasing

Dialect processing capability

Specially optimized for handling Arabic dialects (e.g., Twitter data)

Model Capabilities

News headline generation

Text summarization

Machine translation

Text paraphrasing

Code-switching translation

Question generation

Use Cases

News media

Automatic Arabic news headline generation

Automatically generates multiple candidate headlines based on news article content

As shown in examples, can generate 5 semantically accurate headline variants

Social media

Twitter content summarization

Automatic summarization of Arabic Twitter content

🚀 AraT5: Text-to-Text Transformers for Arabic Language Generation

This repository presents AraT5, a set of powerful Arabic - specific text - to - text Transformer - based models. It includes AraT5_MSA, AraT5_Tweet, and AraT5. These models are designed for various Arabic language generation tasks, such as news title generation, text summarization, and more.

🚀 Quick Start

This is the repository accompanying our paper AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation.

✨ Features

Multifunctional: Suitable for a wide range of Arabic language tasks including news title generation, text summarization, and code - switched translation.
Specifically Designed: Tailored for the Arabic language, considering its unique linguistic features and dialects.

📦 Installation

The AraT5 Pytorch and TensorFlow checkpoints are available on the Huggingface website for direct download and use exclusively for research. For commercial use, please contact the authors via email @ (muhammad.mageed[at]ubc[dot]ca).

Property	Details
Model Type	AraT5-base, AraT5-msa-base, AraT5-tweet-base, AraT5-msa-small, AraT5-tweet-small
Download Link	Huggingface Repository

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base-title-generation")  
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-base-title-generation")

Document = "تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة ."

encoding = tokenizer.encode_plus(Document,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]


outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
    num_return_sequences=5
)

for id, output in enumerate(outputs):
    title = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print("title#"+str(id), title)

Advanced Usage

The above code shows a basic example of fine - tuning AraT5 - base for news title generation on the Aranews dataset. You can adjust the parameters according to your specific needs, such as changing the max_length, top_k, and top_p values to optimize the output quality.

📚 Documentation

If you use our models (Arat5 - base, Arat5 - msa - base, Arat5 - tweet - base, Arat5 - msa - small, or Arat5 - tweet - small) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows:

@inproceedings{nagoudi-etal-2022-arat5,
    title = "{A}ra{T}5: Text-to-Text Transformers for {A}rabic Language Generation",
    author = "Nagoudi, El Moatez Billah  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.47",
    pages = "628--647",
    abstract = "Transfer learning with a unified Transformer framework (T5) that converts all language problems into a text-to-text format was recently proposed as a simple and effective transfer learning approach. Although a multilingual version of the T5 model (mT5) was also introduced, it is not clear how well it can fare on non-English tasks involving diverse data. To investigate this question, we apply mT5 on a language with a wide variety of dialects{--}Arabic. For evaluation, we introduce a novel benchmark for ARabic language GENeration (ARGEN), covering seven important tasks. For model comparison, we pre-train three powerful Arabic T5-style models and evaluate them on ARGEN. Although pre-trained with {\textasciitilde}49 less data, our new models perform significantly better than mT5 on all ARGEN tasks (in 52 out of 59 test sets) and set several new SOTAs. Our models also establish new SOTA on the recently-proposed, large Arabic language understanding evaluation benchmark ARLUE (Abdul-Mageed et al., 2021). Our new models are publicly available. We also link to ARGEN datasets through our repository: https://github.com/UBC-NLP/araT5.",
}

📄 License

The models are available for research use. For commercial use, please contact the authors via email @ (muhammad.mageed[at]ubc[dot]ca).

Acknowledgments

We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, Canadian Foundation for Innovation, ComputeCanada and UBC ARC - Sockeye. We also thank the Google TensorFlow Research Cloud (TFRC) program for providing us with free TPU access.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご