AraModernBert-Topic-Classifier Open-Source Model - A Practical Tool for Arabic Topic Classification Tasks

Aramodernbert Topic Classifier

Developed by NAMAA-Space

This is an experimental Arabic topic classification model based on ModernBERT-base, demonstrating how to adapt ModernBERT for Arabic language tasks.

Text Classification

Transformers

ArabicOpen Source License:Apache-2.0 #Arabic text classification #ModernBERT adaptation #High F1 score

Downloads 28

Release Time : 1/10/2025

Model Overview

This model is trained for Arabic text topic classification tasks, utilizing the base architecture of ModernBERT and a custom Arabic tokenizer.

Model Features

Arabic adaptation

Demonstrates how to adapt ModernBERT for Arabic language tasks

Efficient training

Achieved an F1 score of 0.95 with only 3 training epochs

Specialized topic classification

Focuses on classifying Arabic texts into 7 distinct topics

Model Capabilities

Arabic text classification

Multi-topic recognition

Efficient inference

Use Cases

Content classification

News classification

Classify Arabic news articles by topic

Achieved 0.95 F1 score on test set

Social media analysis

Analyze topic distribution in Arabic social media content

🚀 AraModernBert For Topic Classification

This is an experimental Arabic model that showcases how ModernBERT can be adapted for Arabic in tasks such as topic classification.

🚀 Quick Start

The model can be used for text classification using the transformers library. Below is an example:

from transformers import pipeline

# Load model from huggingface.co/models using our repository ID
classifier = pipeline(
    task="text-classification",
    model="Omartificial-Intelligence-Space/AraModernBert-Topic-Classifier",
)

sample = '''
PUT SOME TEXT HERE TO CLASSIFY ITS TOPIC
'''
classifier(sample)

# [{'label': 'health', 'score': 0.6779336333274841}]

✨ Features

This is an experimental Arabic version of ModernBERT-base, trained ONLY on Topic Classification Task using the base model of original modernbert with a custom Arabic trained tokenizer. The details are as follows:

Dataset: Arabic Wikipedia
Size: 1.8 GB
Tokens: 228,788,529 tokens

This model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification.

📚 Documentation

Model Eval Details

Epochs: 3
Evaluation Metrics:
- F1 Score: 0.95
- Loss: 0.1998
Training Step: 47,862

Dataset Used For Training

SANAD DATASET was used for training and testing, which contains 7 different topics such as Politics, Finance, Medical, Culture, Sport, Tech and Religion.

Test Phase Results

The model was evaluated on a test set of 14181 examples of different topics. The distribution of these topics is:

image/png

The model achieved the following accuracy for prediction on this test set:

image/png

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

@misc{modernbert,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}

⚠️ Important Note

This is an Experimental Arabic Model demonstrates how ModernBERT can be adapted to Arabic for tasks like topic classification.

Property	Details
Model Type	AraModernBert for Topic Classification
Training Data	SANAD DATASET
Base Model	ModernBERT-base
Pipeline Tag	text-classification
Library Name	transformers
Tags	modernbert, arabic

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご