🚀 AraBERTv0.2-Twitter
AraBERTv0.2-Twitter-base/large are two new models for Arabic dialects and tweets, trained by continuing the pre - training using the MLM task on ~60M Arabic tweets (filtered from a collection on 100M).
📦 Datasets
- wikipedia
- Osian
- 1.5B-Arabic-Corpus
- oscar-arabic-unshuffled
- Assafir(private)
- Twitter(private)
🧩 Widget
- text: " عاصمة لبنان هي [MASK] ."

✨ Features
AraBERTv0.2-Twitter-base/large are two new models designed for Arabic dialects and tweets. They are trained by continuing the pre - training using the MLM task on approximately 60 million Arabic tweets (filtered from a collection of 100 million). These two new models have added emojis to their vocabulary, along with common words that were not initially present. The pre - training was carried out with a maximum sentence length of 64 for only 1 epoch.
AraBERT is an Arabic pretrained language model based on Google's BERT architecture. It uses the same BERT - Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup.
📋 Other Models
Model |
HuggingFace Model Name |
Size (MB/Params) |
Pre - Segmentation |
DataSet (Sentences/Size/nWords) |
AraBERTv0.2 - base |
[bert - base - arabertv02](https://huggingface.co/aubmindlab/bert - base - arabertv02) |
543MB / 136M |
No |
200M / 77GB / 8.6B |
AraBERTv0.2 - large |
[bert - large - arabertv02](https://huggingface.co/aubmindlab/bert - large - arabertv02) |
1.38G / 371M |
No |
200M / 77GB / 8.6B |
AraBERTv2 - base |
[bert - base - arabertv2](https://huggingface.co/aubmindlab/bert - base - arabertv2) |
543MB / 136M |
Yes |
200M / 77GB / 8.6B |
AraBERTv2 - large |
[bert - large - arabertv2](https://huggingface.co/aubmindlab/bert - large - arabertv2) |
1.38G / 371M |
Yes |
200M / 77GB / 8.6B |
AraBERTv0.1 - base |
[bert - base - arabertv01](https://huggingface.co/aubmindlab/bert - base - arabertv01) |
543MB / 136M |
No |
77M / 23GB / 2.7B |
AraBERTv1 - base |
[bert - base - arabert](https://huggingface.co/aubmindlab/bert - base - arabert) |
543MB / 136M |
Yes |
77M / 23GB / 2.7B |
AraBERTv0.2 - Twitter - base |
[bert - base - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - base - arabertv02 - twitter) |
543MB / 136M |
No |
Same as v02 + 60M Multi - Dialect Tweets |
AraBERTv0.2 - Twitter - large |
[bert - large - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - large - arabertv02 - twitter) |
1.38G / 371M |
No |
Same as v02 + 60M Multi - Dialect Tweets |
💻 Usage Examples
Basic Usage
from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name="aubmindlab/bert-base-arabertv02-twitter"
arabert_prep = ArabertPreprocessor(model_name=model_name)
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
📚 Documentation
Preprocessing
⚠️ Important Note
The model is trained on a sequence length of 64. Using a max length beyond 64 might result in degraded performance.
It is recommended to apply our preprocessing function before training/testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model.
📄 Citation
If you used this model, please cite us as:
Google Scholar has our Bibtex wrong (missing name), use this instead
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
🙏 Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs. We couldn't have done it without this program. Thanks also to the AUB MIND Lab Members for their continuous support. Additionally, thanks to Yakshof and Assafir for data and storage access. Another thanks to Habib Rahal (https://www.behance.net/rahalhabib) for putting a face to AraBERT.
📞 Contacts
Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com
Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com