Arabic - T5 - small Open - Source Arabic Language Model - An Efficient Application Supporting Multi

Arabic T5 Small

Developed by flax-community

Arabic language model trained on T5v1.1 small architecture, incorporating multiple Arabic datasets for training

Large Language Model Arabic#Arabic text generation #Multi-source data pre-training #Preserving diacritics

Downloads 279

Release Time : 3/2/2022

Model Overview

This is a small T5 model specifically optimized for Arabic, suitable for various text generation and comprehension tasks, preserving Arabic diacritics as part of the vocabulary

Model Features

Arabic optimization

Specifically trained for Arabic, preserving diacritics as part of the vocabulary

Multi-dataset integration

Incorporates Arabic subsets from the Arabic Billion Word Corpus, mC4, and Oscar datasets for training

Simplified preprocessing

Only performs simple replacement processing for URLs, emails, and social media user mentions

Efficient training

Trained with a large batch size (384) and a learning rate of 1e-2

Model Capabilities

Arabic text generation

Arabic text comprehension

Sequence-to-sequence task processing

Use Cases

Natural Language Processing

Arabic machine translation

Can be used for translation tasks between Arabic and other languages

Arabic text summarization

Suitable for automatic summarization of Arabic articles

Arabic question answering system

Can be used to build Arabic question answering applications

🚀 Arabic T5 Small

This is a T5v1.1 (small) model trained on the combined dataset of the Arabic Billion Words corpus and the Arabic subsets of mC4 and Oscar. It offers a foundation for various Arabic language processing tasks.

🚀 Quick Start

This is a T5v1.1 (small) model trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. Due to time constraints, the model was trained on approximately 10% of the entire dataset, which is equivalent to 22,000 steps or about 4.3 billion tokens.

✨ Features

Trained on a large - scale Arabic dataset, combining multiple high - quality corpora.
Keeps Arabic diacritics in the vocabulary, which is different from other pre - trained Arabic LMs.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

# For finetuning, turn on dropout
model = T5ForConditionalGeneration.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)

Advanced Usage

# Another way to load the model for finetuning
model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)

📚 Documentation

Training parameters

Property	Details
Training batch size	`384`
Evaluation batch size	`768`
Learning rate	`1e-2`
Dtype	`jnp.float32`

Preprocessing and the tokenizer

We attempted to minimize preprocessing. We only replaced URLs, emails, and social media user mentions with fixed tokens. Unlike other pre - trained Arabic language models, we decided not to strip Arabic diacritics and kept them as part of the vocabulary.

The tokenizer was trained on 5% of the training set, with a vocabulary size of 64,000. For more details about preprocessing, check the tokenizer code

Data

The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. A random 0.1% subset of the data was reserved for evaluation, and the rest was used for training.

Results

Property	Details
Evaluation accuracy	`56.84%`
Evaluation Loss	`2.423`
Training Loss	`2.392`
Training Time	`22h 23m 51s`

Note for finetuning

This model was pretrained with dropout turned off, so the default dropout_rate in the model config is 0. To finetune the model, dropout should be turned back on, as shown in the code examples above.

🔧 Technical Details

The model is a T5v1.1 (small) architecture. It was trained on a specific combination of datasets with particular pre - processing steps and tokenizer settings. The training parameters and evaluation results provide insights into the model's performance and training process.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご