๐ QARiB: QCRI Arabic and Dialectal BERT
QARiB is a QCRI Arabic and Dialectal BERT model. It addresses the need for high - quality Arabic language processing by leveraging large - scale Arabic data. This model can be used for various NLP tasks such as sentiment analysis, emotion detection, etc.
โจ Features
- Rich Training Data: Trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
- High - Performance: Outperforms multilingual BERT/AraBERT/ArabicBERT on multiple NLP downstream tasks.
๐ฆ Installation
No installation steps are provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")
>>> fill_mask("ุดู ุนูุฏูู
ูุง [MASK]")
[{'sequence': '[CLS] ุดู ุนูุฏูู
ูุง ุนุฑุจ [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'ุนุฑุจ'},
{'sequence': '[CLS] ุดู ุนูุฏูู
ูุง ุฌู
ุงุนุฉ [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'ุฌู
ุงุนุฉ'},
{'sequence': '[CLS] ุดู ุนูุฏูู
ูุง ุดุจุงุจ [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'ุดุจุงุจ'},
{'sequence': '[CLS] ุดู ุนูุฏูู
ูุง ุฑูุงู [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'ุฑูุงู'},
{'sequence': '[CLS] ุดู ุนูุฏูู
ูุง ูุงุณ [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ูุงุณ'}]
>>> fill_mask("ูููู ูุดููููู ูุฑุญู
[MASK]")
[{'sequence': '[CLS] ูููู ูุดููููู ูุฑุญู
ูุงูุฏูู [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'ูุงูุฏูู'},
{'sequence': '[CLS] ูููู ูุดููููู ูุฑุญู
ูู [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##ูู'},
{'sequence': '[CLS] ูููู ูุดููููู ูุฑุญู
ุญุงูู [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'ุญุงูู'},
{'sequence': '[CLS] ูููู ูุดููููู ูุฑุญู
ููู [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ููู'}]
>>> fill_mask("ููุงู
ุงูู
ุฏูุฑ [MASK]")
[
{'sequence': '[CLS] ููุงู
ุงูู
ุฏูุฑ ุจุงูุนู
ู [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'ุจุงูุนู
ู'},
{'sequence': '[CLS] ููุงู
ุงูู
ุฏูุฑ ุจุฐูู [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'ุจุฐูู'},
{'sequence': '[CLS] ููุงู
ุงูู
ุฏูุฑ ุจุงูุงุชุตุงู [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'ุจุงูุงุชุตุงู'},
{'sequence': '[CLS] ููุงู
ุงูู
ุฏูุฑ ุจุนู
ูู [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'ุจุนู
ูู'},
{'sequence': '[CLS] ููุงู
ุงูู
ุฏูุฑ ุจุงูุงู
ุฑ [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'ุจุงูุงู
ุฑ'}
]
>>> fill_mask("ููุงู
ุช ุงูู
ุฏูุฑุฉ [MASK]")
[{'sequence': '[CLS] ููุงู
ุช ุงูู
ุฏูุฑุฉ ุจุฐูู [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'ุจุฐูู'},
{'sequence': '[CLS] ููุงู
ุช ุงูู
ุฏูุฑุฉ ุจุงูุงู
ุฑ [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'ุจุงูุงู
ุฑ'},
{'sequence': '[CLS] ููุงู
ุช ุงูู
ุฏูุฑุฉ ุจุงูุนู
ู [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'ุจุงูุนู
ู'},
{'sequence': '[CLS] ููุงู
ุช ุงูู
ุฏูุฑุฉ ุจุงูุงุชุตุงู [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'ุจุงูุงุชุตุงู'},
{'sequence': '[CLS] ููุงู
ุช ุงูู
ุฏูุฑุฉ ุงูู
ุฏูุฑุฉ [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'ุงูู
ุฏูุฑุฉ'}]
๐ Documentation
About QARiB
The QCRI Arabic and Dialectal BERT (QARiB) model was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text. For tweets, the data was collected using the Twitter API with the language filter lang:ar
. The text data was a combination from Arabic GigaWord, Abulkhair Arabic Corpus and OPUS.
bert - base - qarib60_1970k
Property |
Details |
Data size |
60Gb |
Number of Iterations |
1970k |
Loss |
1.5708898 |
Training QARiB
The training of the model was performed using Googleโs original TensorFlow code on Google Cloud TPU v2. A Google Cloud Storage bucket was used for the persistent storage of training data and models. See more details in Training QARiB
Using QARiB
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on a downstream task. See the model hub to look for fine - tuned versions on a task that interests you. For more details, see Using QARiB
Training procedure
The training of the model was performed using Googleโs original TensorFlow code on eight - core Google Cloud TPU v2. A Google Cloud Storage bucket was used for the persistent storage of training data and models.
Eval results
We evaluated QARiB models on five NLP downstream tasks:
- Sentiment Analysis
- Emotion Detection
- Named - Entity Recognition (NER)
- Offensive Language Detection
- Dialect Identification
The results obtained from QARiB models outperform multilingual BERT/AraBERT/ArabicBERT.
Model Weights and Vocab Download
You can download the model weights and vocab from the Huggingface site: https://huggingface.co/qarib/qarib/bert - base - qarib60_1970k
Contacts
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
Reference
@article{abdelali2021pretraining,
title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
year={2021},
eprint={2102.10684},
archivePrefix={arXiv},
primaryClass={cs.CL}
}