bert-base-arabertv02-twitter Open-source Model - Optimized for Arabic Dialects and Tweet Processing, Supports Emoticons and Common Words!

Bert Base Arabertv02 Twitter

Developed by aubmindlab

A BERT model optimized for Arabic dialects and tweets, pre-trained on 60 million Arabic tweets with MLM tasks, with added support for emojis and common vocabulary.

Large Language Model

Transformers

Arabic#Arabic Tweet Optimization #Multi-dialect Support #Emoji Enhancement

Downloads 2,148

Release Time : 3/2/2022

Model Overview

An Arabic pre-trained model based on Google's BERT architecture, specially optimized for handling Arabic dialects and social media texts.

Model Features

Tweet Optimization

Specially trained on 60 million multi-dialect Arabic tweets, optimized for social media text processing.

Extended Vocabulary

Added support for emojis and previously missing common vocabulary.

Short Text Optimization

Maximum sentence length set to 64 during pre-training, making it particularly suitable for short text processing.

Model Capabilities

Arabic Text Understanding

Social Media Text Analysis

Masked Word Prediction

Dialect Handling

Use Cases

Social Media Analysis

Arabic Tweet Sentiment Analysis

Analyze the sentiment tendencies of Arabic users' tweets.

Dialect Content Understanding

Process social media content in various Arabic dialects.

Text Completion

Arabic Text Auto-Completion

Predict masked Arabic vocabulary.

For example, accurately predicting 'Beirut' in 'The capital of Lebanon is [MASK]'.

🚀 AraBERTv0.2-Twitter

AraBERTv0.2-Twitter-base/large are two new models for Arabic dialects and tweets, trained by continuing the pre - training using the MLM task on ~60M Arabic tweets (filtered from a collection on 100M).

📦 Datasets

wikipedia
Osian
1.5B-Arabic-Corpus
oscar-arabic-unshuffled
Assafir(private)
Twitter(private)

🧩 Widget

text: " عاصمة لبنان هي [MASK] ."

AraBERT Logo

✨ Features

AraBERTv0.2-Twitter-base/large are two new models designed for Arabic dialects and tweets. They are trained by continuing the pre - training using the MLM task on approximately 60 million Arabic tweets (filtered from a collection of 100 million). These two new models have added emojis to their vocabulary, along with common words that were not initially present. The pre - training was carried out with a maximum sentence length of 64 for only 1 epoch.

AraBERT is an Arabic pretrained language model based on Google's BERT architecture. It uses the same BERT - Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup.

📋 Other Models

Model	HuggingFace Model Name	Size (MB/Params)	Pre - Segmentation	DataSet (Sentences/Size/nWords)
AraBERTv0.2 - base	[bert - base - arabertv02](https://huggingface.co/aubmindlab/bert - base - arabertv02)	543MB / 136M	No	200M / 77GB / 8.6B
AraBERTv0.2 - large	[bert - large - arabertv02](https://huggingface.co/aubmindlab/bert - large - arabertv02)	1.38G / 371M	No	200M / 77GB / 8.6B
AraBERTv2 - base	[bert - base - arabertv2](https://huggingface.co/aubmindlab/bert - base - arabertv2)	543MB / 136M	Yes	200M / 77GB / 8.6B
AraBERTv2 - large	[bert - large - arabertv2](https://huggingface.co/aubmindlab/bert - large - arabertv2)	1.38G / 371M	Yes	200M / 77GB / 8.6B
AraBERTv0.1 - base	[bert - base - arabertv01](https://huggingface.co/aubmindlab/bert - base - arabertv01)	543MB / 136M	No	77M / 23GB / 2.7B
AraBERTv1 - base	[bert - base - arabert](https://huggingface.co/aubmindlab/bert - base - arabert)	543MB / 136M	Yes	77M / 23GB / 2.7B
AraBERTv0.2 - Twitter - base	[bert - base - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - base - arabertv02 - twitter)	543MB / 136M	No	Same as v02 + 60M Multi - Dialect Tweets
AraBERTv0.2 - Twitter - large	[bert - large - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - large - arabertv02 - twitter)	1.38G / 371M	No	Same as v02 + 60M Multi - Dialect Tweets

💻 Usage Examples

Basic Usage

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name="aubmindlab/bert-base-arabertv02-twitter"
arabert_prep = ArabertPreprocessor(model_name=model_name)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
  
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")

📚 Documentation

Preprocessing

⚠️ Important Note

The model is trained on a sequence length of 64. Using a max length beyond 64 might result in degraded performance.

It is recommended to apply our preprocessing function before training/testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model.

📄 Citation

If you used this model, please cite us as: Google Scholar has our Bibtex wrong (missing name), use this instead

@inproceedings{antoun2020arabert,
  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
  pages={9}
}

🙏 Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs. We couldn't have done it without this program. Thanks also to the AUB MIND Lab Members for their continuous support. Additionally, thanks to Yakshof and Assafir for data and storage access. Another thanks to Habib Rahal (https://www.behance.net/rahalhabib) for putting a face to AraBERT.

📞 Contacts

Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com

Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご