bert-base-qarib60_1790k Open-source Arabic BERT Model - Supports Multiple Types of Practical NLP Tasks

Bert Base Qarib60 1790k

Developed by ahmedabdelali

QARiB is an Arabic and dialect BERT model trained on approximately 420 million tweets and 180 million text sentences, supporting various downstream NLP tasks.

Large Language Model Arabic#Arabic Dialect Processing #Social Media Text Analysis #Multitask NLP Optimization

Downloads 16

Release Time : 3/2/2022

Model Overview

This model is specifically optimized for Arabic and its dialects, suitable for masked language modeling and various natural language processing tasks, excelling in multiple Arabic NLP tasks.

Model Features

Large-scale Arabic Training Data

Trained on 420 million tweets and 180 million text sentences, covering Modern Standard Arabic and dialects.

Multi-domain Data Integration

Integrates Twitter data, Arabic Billion Word Corpus, Abulkhair Corpus, and OPUS multilingual corpus.

Dialect Support

Specially optimized for processing Arabic dialects.

High Performance

Outperforms multilingual BERT/AraBERT/ArabicBERT in five downstream NLP tasks.

Model Capabilities

Arabic Text Understanding

Dialect Identification

Sentiment Analysis

Named Entity Recognition

Offensive Language Detection

Use Cases

Social Media Analysis

Arabic Tweet Sentiment Analysis

Analyze sentiment tendencies in Arabic tweets.

Outperforms other Arabic BERT models.

Dialect Identification

Identify Arabic dialects in text.

High accuracy.

Text Processing

Named Entity Recognition

Identify entities such as person names and locations in Arabic text.

Offensive Language Detection

Detect offensive content in Arabic text.

🚀 QARiB: QCRI Arabic and Dialectal BERT

QARiB is a BERT model trained on Arabic tweets and text, offering strong performance on multiple NLP downstream tasks.

🚀 Quick Start

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see Using QARiB

✨ Features

Rich Training Data: Trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
High - performance: Outperforms multilingual BERT/AraBERT/ArabicBERT on five NLP downstream tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")

>>> fill_mask("شو عندكم يا [MASK]")
[{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'}, 
{'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'}, 
{'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'}, 
{'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'}, 
{'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}]

>>> fill_mask("قللي وشفيييك يرحم [MASK]")
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'}, 
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'}, 
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]

>>> fill_mask("وقام المدير [MASK]")
[
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'}, 
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'}, 
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'}, 
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'}, 
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
]
>>> fill_mask("وقامت المديرة [MASK]")

[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'}, 
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'}, 
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'}, 
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'}, 
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]

📚 Documentation

About QARiB

The QCRI Arabic and Dialectal BERT (QARiB) model was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text. For Tweets, the data was collected using the Twitter API and a language filter lang:ar. For text data, it was a combination from Arabic GigaWord, Abulkhair Arabic Corpus and OPUS.

bert - base - qarib60_1790k

Data size: 60Gb
Number of Iterations: 1790k
Loss: 1.8764963

Training QARiB

The training of the model has been performed using Google’s original TensorFlow code on Google Cloud TPU v2. We used a Google Cloud Storage bucket for the persistent storage of training data and models. See more details in Training QARiB

Training procedure

The training of the model has been performed using Google’s original TensorFlow code on eight - core Google Cloud TPU v2. We used a Google Cloud Storage bucket for the persistent storage of training data and models.

Eval results

We evaluated QARiB models on five NLP downstream tasks:

Sentiment Analysis
Emotion Detection
Named - Entity Recognition (NER)
Offensive Language Detection
Dialect Identification

The results obtained from QARiB models outperform multilingual BERT/AraBERT/ArabicBERT.

Model Weights and Vocab Download

From the Huggingface site: https://huggingface.co/qarib/qarib/bert - base - qarib60_1790k

Contacts

Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih

Reference

@article{abdelali2021pretraining,
    title={Pre - Training BERT on Arabic Tweets: Practical Considerations},
    author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
    year={2021},
    eprint={2102.10684},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご