đ Arabic T5 Small
This is a T5v1.1 (small) model trained on the combined dataset of the Arabic Billion Words corpus and the Arabic subsets of mC4 and Oscar. It offers a foundation for various Arabic language processing tasks.
đ Quick Start
This is a T5v1.1 (small) model trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. Due to time constraints, the model was trained on approximately 10%
of the entire dataset, which is equivalent to 22,000
steps or about 4.3
billion tokens.
⨠Features
- Trained on a large - scale Arabic dataset, combining multiple high - quality corpora.
- Keeps Arabic diacritics in the vocabulary, which is different from other pre - trained Arabic LMs.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
model = T5ForConditionalGeneration.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
Advanced Usage
model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
đ Documentation
Training parameters
Property |
Details |
Training batch size |
384 |
Evaluation batch size |
768 |
Learning rate |
1e-2 |
Dtype |
jnp.float32 |
Preprocessing and the tokenizer
We attempted to minimize preprocessing. We only replaced URLs, emails, and social media user mentions with fixed tokens. Unlike other pre - trained Arabic language models, we decided not to strip Arabic diacritics and kept them as part of the vocabulary.
The tokenizer was trained on 5%
of the training set, with a vocabulary size of 64,000
. For more details about preprocessing, check the tokenizer code
Data
The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. A random 0.1%
subset of the data was reserved for evaluation, and the rest was used for training.
Results
Property |
Details |
Evaluation accuracy |
56.84% |
Evaluation Loss |
2.423 |
Training Loss |
2.392 |
Training Time |
22h 23m 51s |
Note for finetuning
This model was pretrained with dropout turned off, so the default dropout_rate
in the model config is 0
. To finetune the model, dropout should be turned back on, as shown in the code examples above.
đ§ Technical Details
The model is a T5v1.1 (small) architecture. It was trained on a specific combination of datasets with particular pre - processing steps and tokenizer settings. The training parameters and evaluation results provide insights into the model's performance and training process.
đ License
No license information is provided in the original document, so this section is skipped.