bert-base-arabic-camelbert-msa-did-madar-twitter5 Open-source Model - Supports Classification of 21 Arabic Dialects

Bert Base Arabic Camelbert Msa Did Madar Twitter5

Developed by CAMeL-Lab

An Arabic dialect identification model fine-tuned based on CAMeLBERT-MSA, supporting 21 dialect classifications

ArabicOpen Source License:Apache-2.0 #Arabic Dialect Identification #Social Media Text Analysis #Multi-dialect Classification

Downloads 90

Release Time : 3/2/2022

Model Overview

This model is built by fine-tuning CAMeLBERT-MSA, specifically designed for Arabic dialect identification tasks. Trained on the MADAR Twitter-5 dataset, it can recognize 21 Arabic dialect variants.

Model Features

Multi-dialect Support

Can identify 21 Arabic dialect variants, including Egyptian, Kuwaiti, and other regional dialects

Domain Optimization

Specifically optimized for Twitter social media text, suitable for processing informal Arabic expressions

Academic Validation

Training methods and performance have been systematically validated in ACL-published papers

Model Capabilities

Arabic Dialect Classification

Social Media Text Analysis

Multi-dialect Variant Recognition

Use Cases

Social Media Analysis

Twitter User Geolocation Analysis

Infer potential geographical origins of users based on dialect features in their posts

Can identify 21 Arabic dialects, with accuracy varying by dialect differences

Linguistic Research

Dialect Distribution Research

Analyze the frequency and distribution characteristics of different dialects in specific topics

🚀 CAMeLBERT-MSA DID MADAR Twitter-5 Model

This is a dialect identification (DID) model fine - tuned from CAMeLBERT - MSA, offering high - quality dialect identification for Arabic text.

🚀 Quick Start

You can use the CAMeLBERT - MSA DID MADAR Twitter - 5 model as part of the transformers pipeline. This model will also be available in CAMeL Tools soon.

💻 Usage Examples

Basic Usage

>>> from transformers import pipeline
>>> did = pipeline('text-classification', model='CAMeL-Lab/bert-base-arabic-camelbert-msa-did-madar-twitter5')
>>> sentences = ['عامل ايه ؟', 'شلونك ؟ شخبارك ؟']
>>> did(sentences)
[{'label': 'Egypt', 'score': 0.5741344094276428},
 {'label': 'Kuwait', 'score': 0.5225679278373718}]

Note: to download our models, you would need transformers>=3.5.0. Otherwise, you could download the models manually.

✨ Features

CAMeLBERT - MSA DID MADAR Twitter - 5 Model is a dialect identification (DID) model. It was built by fine - tuning the [CAMeLBERT - MSA](https://huggingface.co/CAMeL - Lab/bert - base - arabic - camelbert - msa/) model. For the fine - tuning, the [MADAR Twitter - 5](https://camel.abudhabi.nyu.edu/madar - shared - task - 2019/) dataset, which includes 21 labels, was used.

📚 Documentation

Our fine - tuning procedure and the hyperparameters we used can be found in our paper "The Interplay of Variant, Size, and Task Type in Arabic Pre - trained Language Models." Our fine - tuning code can be found [here](https://github.com/CAMeL - Lab/CAMeLBERT).

📄 License

This model is licensed under the Apache - 2.0 license.

📚 Citation

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}

Property	Details
Model Type	Dialect identification (DID) model
Training Data	MADAR Twitter - 5 dataset with 21 labels

⚠️ Important Note

To download our models, you would need transformers>=3.5.0. Otherwise, you could download the models manually.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご