bert-base-qarib Open-Source Model - Supports Arabic and Dialect Text Processing, Trained with Massive Data

Bert Base Qarib

Developed by ahmedabdelali

QARiB is a large-scale pre-trained BERT model for Arabic and its dialects, trained on 420 million tweets and 180 million text sentences.

Large Language Model Arabic#Arabic Dialect Processing #Social Media Text Analysis #Multi-task Fine-tuning Adaptation

Downloads 602

Release Time : 3/2/2022

Model Overview

A BERT model specifically optimized for Arabic and its dialects, supporting masked language modeling and downstream task fine-tuning, with excellent performance across multiple Arabic NLP tasks.

Model Features

Dialect Support

Specifically optimized for Arabic dialects, achieving 65.21% accuracy in dialect identification tasks

Large-scale Training Data

Trained on 14 billion Arabic tokens, including tweets and formal texts

Excellent Multi-task Performance

Outperforms other Arabic BERT models in tasks such as sentiment detection and offensive language detection

Model Capabilities

Arabic text understanding

Dialect identification

Sentiment analysis

Named entity recognition

Offensive language detection

Masked language modeling

Use Cases

Social Media Analysis

Arabic Tweet Sentiment Analysis

Analyze the sentiment polarity of Arabic tweets

Achieves 93.31% accuracy in sentiment analysis tasks

Offensive Content Detection

Identify offensive Arabic content on social media

91.94% accuracy, outperforming similar models

Linguistic Research

Arabic Dialect Identification

Distinguish between regional variants of Arabic dialects

65.21% identification accuracy

🚀 QARiB: QCRI Arabic and Dialectal BERT

QARiB is a pre - trained BERT model specifically designed for Arabic and dialectal Arabic. It is trained on a large - scale dataset, enabling it to handle various natural language processing tasks effectively.

🚀 Quick Start

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on a downstream task. See the model hub to look for fine - tuned versions on a task that interests you. For more details, see Using QARiB

✨ Features

Trained on a vast collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
Demonstrates excellent performance in multiple NLP tasks such as dialect identification, emotion detection, named - entity recognition, offensive language detection, and sentiment analysis.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")

>>> fill_mask("شو عندكم يا [MASK]")
[{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'}, 
{'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'}, 
{'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'}, 
{'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'}, 
{'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}
]
>>> fill_mask("وقام المدير [MASK]")
[
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'}, 
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'}, 
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'}, 
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'}, 
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
]
>>> fill_mask("وقامت المديرة [MASK]")

[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'}, 
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'}, 
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'}, 
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'}, 
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]

>>> fill_mask("قللي وشفيييك يرحم [MASK]")
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'}, 
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]

📚 Documentation

About QARiB

The QCRI Arabic and Dialectal BERT (QARiB) model was trained on a collection of approximately 420 million tweets and about 180 million sentences of text. For the tweets, the data was collected using the Twitter API with the language filter lang:ar. The text data was a combination from Arabic GigaWord, Abulkhair Arabic Corpus and OPUS.

QARiB is the Arabic name for "Boat".

Model and Parameters

Property	Details
Data size	14B tokens
Vocabulary	64k
Iterations	10M
Number of Layers	12

Training QARiB

See details in Training QARiB

Evaluations

Experiment	mBERT	AraBERT0.1	AraBERT1.0	ArabicBERT	QARiB
Dialect Identification	6.06%	59.92%	59.85%	61.70%	65.21%
Emotion Detection	27.90%	43.89%	42.37%	41.65%	44.35%
Named - Entity Recognition (NER)	49.38%	64.97%	66.63%	64.04%	61.62%
Offensive Language Detection	83.14%	88.07%	88.97%	88.19%	91.94%
Sentiment Analysis	86.61%	90.80%	93.58%	83.27%	93.31%

Model Weights and Vocab Download

You can download the model weights and vocabulary from the Huggingface site: https://huggingface.co/qarib/bert-base-qarib

Contacts

Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih

Reference

@article{abdelali2021pretraining,
    title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
    author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
    year={2021},
    eprint={2102.10684},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご