AraModernBert-Base-V1.0 Open-Source Arabic Language Model - Empowering Efficient Processing of Arabic Content

Aramodernbert Base V1.0

Developed by NAMAA-Space

AraModernBert is an advanced Arabic language model built on the ModernBERT architecture, combining Transformer design innovations with large-scale training on 100GB of Arabic text.

Large Language Model

Transformers

ArabicOpen Source License:Apache-2.0 #Arabic long-text processing #Cross-tokenization techniques #8192 context window

Downloads 660

Release Time : 2/1/2025

Model Overview

This model is specifically designed for Arabic language understanding, suitable for various NLP tasks such as text embedding, information retrieval, and text classification.

Model Features

Cross-tokenization techniques

Utilizes cross-tokenization techniques to optimize the initialization of embedding layers for MLM tasks, enhancing model performance.

Long-context support

Supports a context window of 8,192 tokens, making it suitable for processing long texts.

Dedicated Arabic tokenizer

Uses a custom tokenizer with a vocabulary of 50,280 words, specifically optimized for Arabic language processing.

Alternating attention mechanism

Features a hybrid attention architecture with global attention every 3 layers and a local window of 128 tokens.

Model Capabilities

Arabic text understanding

Masked language modeling

Semantic text similarity computation

Text classification

Named entity recognition

Use Cases

Text analysis

Semantic text similarity

Computes the semantic similarity between two Arabic texts.

STS17: 0.831, STS22: 0.617

Text classification

Classifies Arabic texts.

Accuracy 94.32%, F1 score 94.31%

Information retrieval

Retrieval-Augmented Generation (RAG)

Used as a retrieval component for Arabic question-answering systems.

🚀 AraModernBert-base-V1.0

AraModernBert is an advanced Arabic language model. It's built on the ModernBERT architecture, significantly advancing Arabic language understanding. It combines state - of - the - art transformer design with training on 100 GiGaBytes of Arabic text.

🚀 Quick Start

Here's how to use AraModernBert with the Transformers library:

Basic Usage

from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/AraModernBert-Base-V1.0")
model = AutoModel.from_pretrained("NAMAA-Space/AraModernBert-Base-V1.0")

# Encode text
text = "مرحبا بكم في عالم الذكاء الاصطناعي"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get embeddings
embeddings = outputs.last_hidden_state

Advanced Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/AraModernBert-Base-V1.0")
model = AutoModelForMaskedLM.from_pretrained("NAMAA-Space/AraModernBert-Base-V1.0")

text = "الذكاء الاصطناعي هو [MASK] المستقبل."
inputs = tokenizer(text, return_tensors="pt")
token_index = torch.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0]

outputs = model(**inputs)
predictions = outputs.logits
predicted_token_id = torch.argmax(predictions[0, token_index]).item()
predicted_token = tokenizer.decode(predicted_token_id)
print(predicted_token)

✨ Features

Advanced Architecture: Built on the ModernBERT architecture, it combines state - of - the - art transformer design.
Large - Scale Training: Trained on 100 GigaBytes of Arabic text.
Custom Tokenizer: A specialized tokenizer with a vocabulary size of 50,280 tokens, optimized for Arabic language processing.
Transtokenization Technique: Used to optimally initialize the embedding layer for MLM.
Long Context Support: Can handle up to 8192 tokens for processing longer documents.

📚 Documentation

Model Configuration

{
  "hidden_size": 768,
  "intermediate_size": 1152,
  "num_attention_heads": 12,
  "num_hidden_layers": 22,
  "max_position_embeddings": 8192,
  "vocab_size": 50280,
  "global_attn_every_n_layers": 3,
  "local_attention": 128,
  "global_rope_theta": 160000.0,
  "local_rope_theta": 10000.0,
  "architectures": ["ModernBertForMaskedLM"],
  "model_type": "modernbert",
  "cls_token_id": 3,
  "mask_token_id": 6,
  "pad_token_id": 5,
  "sep_token_id": 4,
  "unk_token_id": 2
}

Intended Uses & Limitations

AraModernBert can be used for a wide range of Arabic NLP tasks, including:

Text Embeddings & Representation
Information Retrieval
RAG (Retrieval Augmented Generation)
Document Similarity
Text Classification
Sentiment Analysis

Limitations and Biases

The model is optimized for Modern Standard Arabic and may show varying performance on dialectal Arabic variants or classical Arabic texts.
Performance may vary across domains and specialized terminology.
Users should be aware of potential biases present in the training data.

Evaluation Results

image/png

1. Semantic Textual Similarity (STS)

We fine - tuned the model on STS datasets to enhance semantic understanding capabilities:

STS17: 0.831
STS22: 0.617

Note: The STS - optimized model will be released soon as a separate checkpoint.

2. Text Classification

We finetuned AraModernBert on a multi - class classification task using the SANAD dataset.

Overall Metrics:

AraModernBert:
- Accuracy: 94.32%
- F1 Score: 94.31%
- Precision: 94.31%
- Recall: 94.32%

Per - Class Performance (AraModernBert):

Class	Precision	Recall	F1 - Score	Support
0	92.13%	92.43%	92.28%	1,849
1	93.63%	93.70%	93.67%	3,937
2	90.70%	90.70%	90.70%	2,075
3	96.30%	93.81%	95.04%	776
4	96.09%	95.84%	95.96%	1,898
5	89.24%	87.99%	88.61%	641
6	98.55%	99.37%	98.96%	3,005

3. Named Entity Recognition (NER)

The model achieved excellent performance on Arabic NER tasks:

Accuracy: 90.39%
Precision: 0.7357
Recall: 0.7442
F1: 0.7399

Model Architecture

AraModernBert inherits the modern architecture features from ModernBERT, while adding Trans - Tokenization approach:

22 transformer layers with 768 hidden dimensions
Alternating Attention mechanism with global attention every 3 layers and a local attention window of 128 tokens
Rotary Positional Embeddings (RoPE) with different theta values for global (160000.0) and local (10000.0) attention
8,192 token context window for processing longer documents
Specialized vocabulary of 50,280 tokens optimized for Arabic

Technical Specifications

Property	Details
Base Architecture	ModernBERT
Parameters	~149M (based on configuration)
Context Length	8,192 tokens
Vocabulary Size	50,280
Hidden Size	768
Attention Heads	12
Hidden Layers	22
Intermediate Size	1152

Citation

If you use this model in your research, please cite:

@misc{AraModernBERT2025,
  title={AraModernBERT: Advanced Arabic Language Model Through Trans-Tokenization and ModernBERT architecture},
  author={NAMAA},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/NAMAA-Space/AraModernBert-Base-V1.0}},
  note={Accessed: 2025-03-02}
}

Acknowledgements

This model builds upon the ModernBERT architecture developed by Answer.AI and LightOn. We acknowledge their contributions to the field of encoder - only models and extend their work to the Arabic language through our novel Trans - Tokenized approach.

@misc{modernbert,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}

@inproceedings{remy-delobelle2024transtokenization,
    title={Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of {LLM}s for Low-Resource {NLP}},
    author={Remy, Fran{\c{c}}ois and Delobelle, Pieter and Avetisyan, Hayastan and Khabibullina, Alfiya and de Lhoneux, Miryam and Demeester, Thomas},
    booktitle={First Conference on Language Modeling},
    year={2024},
    url={https://openreview.net/forum?id=sBxvoDhvao}
}

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご