MARBERTv2 Arabic Dialect Classifier - Open-source and free to identify five major written Arabic dialects

Marbertv2 Arabic Written Dialect Classifier

Developed by IbrahimAmin

An Arabic dialect classifier fine-tuned based on MARBERTv2, capable of identifying five major Arabic written dialects

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Arabic dialect identification #Multi-dialect classification #Social media text analysis

Downloads 113

Release Time : 5/7/2025

Model Overview

This model is used for Arabic written dialect classification and can identify Modern Standard Arabic (MSA) and 4 regional Arabic dialects (Maghrebi, Levantine, Gulf, and Egyptian dialects) from raw text.

Model Features

Multi-dialect identification

Able to distinguish five major Arabic written dialect regions, including Maghrebi, Levantine, Modern Standard Arabic, Gulf, and Egyptian dialects

Large-scale training data

Trained using approximately 850,000+ Arabic sentences from 9 different public datasets

Social media optimization

Particularly suitable for dialect identification of short Arabic text fragments, with data sources including social media, forums, and informal writing

Model Capabilities

Arabic dialect classification

Text analysis

Social media content identification

Use Cases

Language research

Dialect distribution research

Analyze the geographical distribution of different Arabic dialects on social media

Natural language processing

Dialect-aware system

Provide customized NLP services for users in different dialect regions

🚀 MARBERTv2 Arabic Written Dialect Classifier

This model is a fine - tuned version of the pre - trained model for Arabic written dialect classification. It can identify Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text, which is useful for dialect identification, linguistic research, and dialect - aware natural language processing systems.

📚 Documentation

✨ Features

This model is a fine - tuned version of UBC-NLP/MARBERTv2 for Arabic written dialect classification.
It can distinguish among five major written Arabic dialect regions: MAGHREB, LEV, MSA, GLF, and EGY.
It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.

🔧 Technical Details

Base Model: This model is fine - tuned from MARBERTv2, a transformer - based language model optimized for Arabic, on a multi - dialect classification task.
Labels (id2label): The model predicts one of the following five classes:

{
  "0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.)
  "1": "LEV",     // Levantine dialect (Lebanon, Syria, Jordan, Palestine)
  "2": "MSA",     // Modern Standard Arabic
  "3": "GLF",     // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.)
  "4": "EGY"      // Egyptian dialect
}

Training Data: The model was trained on about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects. The distribution by dialect is as follows: | Dialect | Count | |-----------|----------| | GLF | 253,553 | | LEV | 243,025 | | MAGHREB | 140,887 | | EGY | 105,226 | | MSA | 83,231 |
Training Details:
- Architecture: MARBERTv2 (BERT - based)
- Task: Text Classification (Dialect Identification)
- Objective: Multi - class classification with softmax over 5 dialect classes
- Tokenizer: UBC-NLP/MARBERTv2
Datasets Used: | Dataset | Brief Description | Annotation strategy | Provided Labels | Current SOTA Performance | | :------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------: | :-----------------------: | :--------------------------: | | MADAR Subtask - 1 (MADAR - 6) | A Collection of parallel sentences (BTEC) covering the dialects of 5 cities from the Arab World and MSA in the travel domain (10,000 sentences per city) | Manual | 5 Arab Cities + MSA | 92.5% Accuracy | | MADAR Subtask - 1 (MADAR - 26) | A Collection of parallel sentences (BTEC) covering the dialects of 25 cities from the Arab World and MSA in the travel domain (2,000 sentences per city) | Manual | 25 Arab Cities + MSA | 67.32% F1 - Score | | DART | 25K tweets that are annotated via crowdsourcing and it is well - balanced over five main groups of Arabic dialects | Manual | 5 Arab Regions | UNK | | ArSarcasm v1 | 10,547 tweets from ASTD and SemEval datasets for Sarcasm detection with the dilaect information added in | Manual | 4 Arab Regions + MSA | UNK | | ArSarcasm v2 | ArSarcasm - v2 dataset contains 15,548 Tweets and is an extension of the original ArSarcasm dataset (Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets) | Manual | 4 Arab Regions + MSA | UNK | | IADD | Five publicly available corpora were identified, analyzed and filtered to build IADD (AOC, DART, PADIC, SHAMI and TSAC) | ________ | 5 Regions and 9 Countries | UNK | | QADI | 540k tweets (30k per country on average) with a total of 8.8M words | Automatic | 18 Arab Countries | 60.6% | | AOC | The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:AlGhad from JOR, Al - Riyadh from KSA, and Al - Youm Al - Sabe‚Äô from EGY | Manual | 3 Arab Regions + MSA | UNK | | NADI - 2020 | 25,957 Tweets from 100 Arab provinces and 21 Arab countries | Automatic | 100 Prov. and 21 Coun. | 6.39% - 26.78% |

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "ÿßŸÑÿØŸÜŸäÿß ŸÖÿ¥ ŸÖÿ≥ÿ™ÿßŸáŸÑÿ© ÿ™ÿ¨ÿ±Ÿä ŸÉÿØŸáÿå ÿÆÿØ ŸàŸÇÿ™ŸÉ Ÿàÿßÿ≥ÿ™ŸÖÿ™ÿπ ÿ®ÿßŸÑÿ≠ÿßÿ¨ÿ© ÿßŸÑÿ®ÿ≥Ÿäÿ∑ÿ©"
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.inference_mode():
    logits = model(**inputs).logits

pred = torch.argmax(logits, dim=-1).item()

print(f"Predicted Dialect: {model.config.id2label[pred]}")

📄 License

This model is released under the apache - 2.0 license.

🚀 Acknowledgements

MARBERTv2 team at UBC - NLP
Contributors of the Arabic dialect datasets used in training

📚 Citation

If you use this model in your research or application, please cite:

@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
  author = {Ibrahim Amin},
  title = {MARBERTv2 Arabic Written Dialect Classifier},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご