AraBERTv0.2-Twitter Open-Source Language Model - Optimizing Arabic Dialect Tweet Processing, Supporting Emojis and New Words

Bert Large Arabertv02 Twitter

Developed by aubmindlab

AraBERTv0.2-Twitter is a pre-trained language model optimized for Arabic dialects and tweets, developed based on the BERT architecture, with added support for emojis and common vocabulary.

Large Language Model

Transformers

Arabic#Arabic Tweet Optimization #Multi-dialect Support #Emoji Handling

Downloads 312

Release Time : 3/2/2022

Model Overview

This model was obtained through continued pre-training on approximately 60 million Arabic tweets, specifically optimized for understanding Arabic dialects and social media texts.

Model Features

Dialect Optimization

Specially optimized for Arabic dialects and tweet content.

Emoji Support

Added emojis and common social media vocabulary to the lexicon.

Short Text Optimization

Trained for sequence lengths of 64 tokens, suitable for social media short texts.

Model Capabilities

Arabic Text Understanding

Social Media Text Processing

Masked Language Prediction

Use Cases

Social Media Analysis

Arabic Tweet Sentiment Analysis

Analyze sentiment tendencies in Arabic tweets.

Dialect Text Understanding

Process dialect texts from different Arabic regions.

Language Model Applications

Text Completion

Predict masked words or phrases.

Example: 'The capital of Lebanon is [MASK]' can be predicted as 'Beirut'.

🚀 AraBERTv0.2-Twitter

AraBERTv0.2-Twitter is designed to handle Arabic dialects and tweets effectively. It continues the pre - training using the MLM task on approximately 60M Arabic tweets (filtered from a collection of 100M). The vocabulary of these two new models includes emojis and common words that were not initially present. The pre - training was conducted with a maximum sentence length of 64 for only 1 epoch.

AraBERT is an Arabic pre - trained language model based on Google's BERT architecture. It uses the same BERT - Base configuration. More details can be found in the AraBERT Paper and the AraBERT Meetup.

📦 Datasets

wikipedia
Osian
1.5B - Arabic - Corpus
oscar - arabic - unshuffled
Assafir(private)
Twitter(private)

🎛️ Widget

text: " عاصمة لبنان هي [MASK] ."

✨ Features

Vocabulary Expansion: The models have emojis and additional common words in their vocabulary.
Dialect and Tweet Focus: Specifically trained for Arabic dialects and tweets.

📋 Other Models

Property	Details
Model Type	AraBERTv0.2 - base, AraBERTv0.2 - large, AraBERTv2 - base, AraBERTv2 - large, AraBERTv0.1 - base, AraBERTv1 - base, AraBERTv0.2 - Twitter - base, AraBERTv0.2 - Twitter - large
HuggingFace Model Name	[bert - base - arabertv02](https://huggingface.co/aubmindlab/bert - base - arabertv02), [bert - large - arabertv02](https://huggingface.co/aubmindlab/bert - large - arabertv02), [bert - base - arabertv2](https://huggingface.co/aubmindlab/bert - base - arabertv2), [bert - large - arabertv2](https://huggingface.co/aubmindlab/bert - large - arabertv2), [bert - base - arabertv01](https://huggingface.co/aubmindlab/bert - base - arabertv01), [bert - base - arabert](https://huggingface.co/aubmindlab/bert - base - arabert), [bert - base - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - base - arabertv02 - twitter), [bert - large - arabertv02 - twitter](https://huggingface.co/aubmindlab/bert - large - arabertv02 - twitter)
Size (MB/Params)	543MB / 136M, 1.38G / 371M
Pre - Segmentation	No, Yes
DataSet (Sentences/Size/nWords)	200M / 77GB / 8.6B, 77M / 23GB / 2.7B, Same as v02 + 60M Multi - Dialect Tweets

💻 Usage Examples

Basic Usage

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name="aubmindlab/bert-base-arabertv02-twitter"
arabert_prep = ArabertPreprocessor(model_name=model_name)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
  
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")

📚 Documentation

Preprocessing

⚠️ Important Note

The model is trained on a sequence length of 64. Using a max length beyond 64 might result in degraded performance.

It is recommended to apply the preprocessing function before training or testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model.

📄 Citation

If you used this model, please cite us as: Google Scholar has our Bibtex wrong (missing name), use this instead

@inproceedings{antoun2020arabert,
  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
  pages={9}
}

🙏 Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs. This project couldn't have been done without this program. Thanks to the AUB MIND Lab Members for their continuous support. Also, thanks to Yakshof and Assafir for data and storage access. Another thanks to Habib Rahal (https://www.behance.net/rahalhabib) for putting a face to AraBERT.

📞 Contacts

Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com
Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご