QARiB Open-Source Arabic BERT Model - Free Deployment and Support for Multiple Arabic NLP Tasks

Bert Base Qarib60 1970k

Developed by ahmedabdelali

QARiB is a BERT model based on Arabic and its dialects, trained on approximately 420 million tweets and 180 million text sentences, supporting various Arabic NLP tasks.

Large Language Model Arabic#Arabic Dialect Processing #Social Media Text Analysis #Multitask NLP Optimization

Downloads 41

Release Time : 3/2/2022

Model Overview

This model is specifically optimized for Arabic and its dialects, suitable for masked language modeling and various downstream NLP tasks such as sentiment analysis and named entity recognition.

Model Features

Arabic and Dialect Optimization

Specially trained and optimized for Arabic and its dialects, enabling better processing of Arabic text.

Large-scale Training Data

Trained on approximately 420 million tweets and 180 million text sentences, with diverse and comprehensive data sources.

Multitask Support

Supports various Arabic NLP tasks, including sentiment analysis, emotion detection, named entity recognition, and more.

Model Capabilities

Masked Language Modeling

Next Sentence Prediction

Sentiment Analysis

Emotion Detection

Named Entity Recognition

Offensive Language Detection

Dialect Recognition

Use Cases

Social Media Analysis

Arabic Tweet Sentiment Analysis

Analyze the sentiment tendencies of Arabic tweets

Excellent performance in sentiment analysis tasks

Offensive Language Detection

Identify offensive content in Arabic social media

Language Research

Arabic Dialect Identification

Identify dialect variants in Arabic text

🚀 QARiB: QCRI Arabic and Dialectal BERT

QARiB is a QCRI Arabic and Dialectal BERT model. It addresses the need for high - quality Arabic language processing by leveraging large - scale Arabic data. This model can be used for various NLP tasks such as sentiment analysis, emotion detection, etc.

✨ Features

Rich Training Data: Trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
High - Performance: Outperforms multilingual BERT/AraBERT/ArabicBERT on multiple NLP downstream tasks.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")

>>> fill_mask("شو عندكم يا [MASK]")
[{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'}, 
{'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'}, 
{'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'}, 
{'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'}, 
{'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}]

>>> fill_mask("قللي وشفيييك يرحم [MASK]")
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'}, 
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]

>>> fill_mask("وقام المدير [MASK]")
[
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'}, 
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'}, 
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'}, 
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'}, 
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
]
>>> fill_mask("وقامت المديرة [MASK]")

[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'}, 
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'}, 
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'}, 
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'}, 
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]

📚 Documentation

About QARiB

The QCRI Arabic and Dialectal BERT (QARiB) model was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text. For tweets, the data was collected using the Twitter API with the language filter lang:ar. The text data was a combination from Arabic GigaWord, Abulkhair Arabic Corpus and OPUS.

bert - base - qarib60_1970k

Property	Details
Data size	60Gb
Number of Iterations	1970k
Loss	1.5708898

Training QARiB

The training of the model was performed using Google’s original TensorFlow code on Google Cloud TPU v2. A Google Cloud Storage bucket was used for the persistent storage of training data and models. See more details in Training QARiB

Using QARiB

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on a downstream task. See the model hub to look for fine - tuned versions on a task that interests you. For more details, see Using QARiB

Training procedure

The training of the model was performed using Google’s original TensorFlow code on eight - core Google Cloud TPU v2. A Google Cloud Storage bucket was used for the persistent storage of training data and models.

Eval results

We evaluated QARiB models on five NLP downstream tasks:

Sentiment Analysis
Emotion Detection
Named - Entity Recognition (NER)
Offensive Language Detection
Dialect Identification

The results obtained from QARiB models outperform multilingual BERT/AraBERT/ArabicBERT.

Model Weights and Vocab Download

You can download the model weights and vocab from the Huggingface site: https://huggingface.co/qarib/qarib/bert - base - qarib60_1970k

Contacts

Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih

Reference

@article{abdelali2021pretraining,
    title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
    author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
    year={2021},
    eprint={2102.10684},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご