🚀 MARBERTv2 Arabic Written Dialect Classifier
This model is a fine - tuned version of the pre - trained model for Arabic written dialect classification. It can identify Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text, which is useful for dialect identification, linguistic research, and dialect - aware natural language processing systems.
📚 Documentation
✨ Features
- This model is a fine - tuned version of
UBC-NLP/MARBERTv2
for Arabic written dialect classification.
- It can distinguish among five major written Arabic dialect regions: MAGHREB, LEV, MSA, GLF, and EGY.
- It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.
🔧 Technical Details
- Base Model: This model is fine - tuned from MARBERTv2, a transformer - based language model optimized for Arabic, on a multi - dialect classification task.
- Labels (
id2label
): The model predicts one of the following five classes:
{
"0": "MAGHREB",
"1": "LEV",
"2": "MSA",
"3": "GLF",
"4": "EGY"
}
- Training Data: The model was trained on about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects. The distribution by dialect is as follows:
| Dialect | Count |
|-----------|----------|
| GLF | 253,553 |
| LEV | 243,025 |
| MAGHREB | 140,887 |
| EGY | 105,226 |
| MSA | 83,231 |
- Training Details:
- Architecture: MARBERTv2 (BERT - based)
- Task: Text Classification (Dialect Identification)
- Objective: Multi - class classification with softmax over 5 dialect classes
- Tokenizer:
UBC-NLP/MARBERTv2
- Datasets Used:
| Dataset | Brief Description | Annotation strategy | Provided Labels | Current SOTA Performance |
| :------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------: | :-----------------------: | :--------------------------: |
| MADAR Subtask - 1 (MADAR - 6) | A Collection of
parallel sentences (BTEC)
covering the dialects of 5 cities from the Arab World and MSA
in the travel domain (10,000 sentences per city)
| Manual | 5 Arab Cities + MSA | 92.5% Accuracy |
| MADAR Subtask - 1 (MADAR - 26) | A Collection of parallel sentences (BTEC)
covering the dialects of 25 cities from the Arab World and MSA
in the travel domain (2,000 sentences per city)
| Manual | 25 Arab Cities + MSA | 67.32% F1 - Score |
| DART | 25K tweets
that are annotated via crowdsourcing and it is well - balanced over five main groups of Arabic dialects | Manual | 5 Arab Regions | UNK |
| ArSarcasm v1 | 10,547 tweets
from ASTD and SemEval datasets
for Sarcasm detection with the dilaect information added in | Manual | 4 Arab Regions + MSA | UNK |
| ArSarcasm v2 | ArSarcasm - v2 dataset contains 15,548 Tweets
and is an extension of the original ArSarcasm dataset (Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets)
| Manual | 4 Arab Regions + MSA | UNK |
| IADD | Five publicly available corpora
were identified, analyzed and filtered to build IADD (AOC, DART, PADIC, SHAMI and TSAC)
| ________ | 5 Regions and 9 Countries | UNK |
| QADI | 540k tweets
(30k per country on average) with a total of 8.8M words | Automatic | 18 Arab Countries | 60.6% |
| AOC | The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:AlGhad from JOR, Al - Riyadh from KSA, and Al - Youm Al - Sabe’ from EGY
| Manual | 3 Arab Regions + MSA | UNK |
| NADI - 2020 | 25,957 Tweets
from 100 Arab provinces and 21 Arab countries | Automatic | 100 Prov. and 21 Coun. | 6.39% - 26.78% |
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة"
inputs = tokenizer(text, return_tensors="pt")
with torch.inference_mode():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print(f"Predicted Dialect: {model.config.id2label[pred]}")
📄 License
This model is released under the apache - 2.0
license.
🚀 Acknowledgements
- MARBERTv2 team at UBC - NLP
- Contributors of the Arabic dialect datasets used in training
📚 Citation
If you use this model in your research or application, please cite:
@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
author = {Ibrahim Amin},
title = {MARBERTv2 Arabic Written Dialect Classifier},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
}