🚀 AraBERTv0.2-Twitter
AraBERTv0.2-Twitter is designed to handle Arabic dialects and tweets effectively. It continues the pre - training using the MLM task on approximately 60M Arabic tweets (filtered from a collection of 100M). The vocabulary of these two new models includes emojis and common words that were not initially present. The pre - training was conducted with a maximum sentence length of 64 for only 1 epoch.
AraBERT is an Arabic pre - trained language model based on Google's BERT architecture. It uses the same BERT - Base configuration. More details can be found in the AraBERT Paper and the AraBERT Meetup.
📦 Datasets
- wikipedia
- Osian
- 1.5B - Arabic - Corpus
- oscar - arabic - unshuffled
- Assafir(private)
- Twitter(private)
🎛️ Widget
- text: " عاصمة لبنان هي [MASK] ."
✨ Features
- Vocabulary Expansion: The models have emojis and additional common words in their vocabulary.
- Dialect and Tweet Focus: Specifically trained for Arabic dialects and tweets.
📋 Other Models
Property |
Details |
Model Type |
AraBERTv0.2 - base, AraBERTv0.2 - large, AraBERTv2 - base, AraBERTv2 - large, AraBERTv0.1 - base, AraBERTv1 - base, AraBERTv0.2 - Twitter - base, AraBERTv0.2 - Twitter - large |
HuggingFace Model Name |
[bert - base - arabertv02](https://huggingface.co/aubmindlab/bert - base - arabertv02), [bert - large - arabertv02](https://huggingface.co/aubmindlab/bert - large - arabertv02), [bert - base - arabertv2](https://huggingface.co/aubmindlab/bert - base - arabertv2), [bert - large - arabertv2](https://huggingface.co/aubmindlab/bert - large - arabertv2), [bert - base - arabertv01](https://huggingface.co/aubmindlab/bert - base - arabertv01), [bert - base - arabert](https://huggingface.co/aubmindlab/bert - base - arabert), [bert - base - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - base - arabertv02 - twitter), [bert - large - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - large - arabertv02 - twitter) |
Size (MB/Params) |
543MB / 136M, 1.38G / 371M |
Pre - Segmentation |
No, Yes |
DataSet (Sentences/Size/nWords) |
200M / 77GB / 8.6B, 77M / 23GB / 2.7B, Same as v02 + 60M Multi - Dialect Tweets |
💻 Usage Examples
Basic Usage
from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name="aubmindlab/bert-base-arabertv02-twitter"
arabert_prep = ArabertPreprocessor(model_name=model_name)
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
📚 Documentation
Preprocessing
⚠️ Important Note
The model is trained on a sequence length of 64. Using a max length beyond 64 might result in degraded performance.
It is recommended to apply the preprocessing function before training or testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model.
📄 Citation
If you used this model, please cite us as:
Google Scholar has our Bibtex wrong (missing name), use this instead
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
🙏 Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs. This project couldn't have been done without this program. Thanks to the AUB MIND Lab Members for their continuous support. Also, thanks to Yakshof and Assafir for data and storage access. Another thanks to Habib Rahal (https://www.behance.net/rahalhabib) for putting a face to AraBERT.
📞 Contacts