đ Arabic BERT Mini Model
A pre-trained BERT Mini language model designed for Arabic, aiming to facilitate natural language processing tasks in the Arabic language.
If you use this model in your work, please cite this paper:
@inproceedings{safaya-etal-2020-kuisail,
title = "{KUISAIL} at {S}em{E}val-2020 Task 12: {BERT}-{CNN} for Offensive Speech Identification in Social Media",
author = "Safaya, Ali and
Abdullatif, Moutasem and
Yuret, Deniz",
booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
month = dec,
year = "2020",
address = "Barcelona (online)",
publisher = "International Committee for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.semeval-1.271",
pages = "2054--2059",
}
⨠Features
Model Datasets
đ Documentation
Pretraining Corpus
The arabic-bert-mini
model was pre-trained on approximately 8.2 billion words, sourced from:
It also incorporates other Arabic resources, totaling around 95GB of text.
Notes on training data:
- The final version of the corpus contains some non-Arabic words within sentences. These were not removed as it could impact tasks such as Named Entity Recognition (NER).
- Although non-Arabic characters were lowercased as a preprocessing step, since Arabic characters do not have upper or lowercase, there is no cased and uncased version of the model.
- The corpus and vocabulary set are not limited to Modern Standard Arabic; they also include some dialectical Arabic.
Pretraining Details
- This model was trained using Google BERT's GitHub repository on a single TPU v3 - 8, provided free of charge by TFRC.
- The pre-training procedure follows the training settings of BERT with some modifications: it was trained for 3 million training steps with a batch size of 128, instead of 1 million steps with a batch size of 256.
đģ Usage Examples
Basic Usage
You can use this model by installing torch
or tensorflow
and the Huggingface library transformers
. Initialize it as follows:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-mini-arabic")
model = AutoModelForMaskedLM.from_pretrained("asafaya/bert-mini-arabic")
đĻ Installation
To use the model, you need to install the following libraries:
torch
or tensorflow
- Huggingface library
transformers
đ License
No license information provided in the original document.
đ§ Technical Details
Model Performance
For further details on the model's performance or any other queries, please refer to Arabic-BERT.
đ Acknowledgement
Thanks to Google for providing a free TPU for the training process and to Huggingface for hosting this model on their servers đ