AraEuroBert-210M Open-source Model - Supports Semantic Embedding to Meet Arabic Semantic Requirements

Home

Araeurobert 210M

Developed by Omartificial-Intelligence-Space

Arabic semantic embedding model fine-tuned based on EuroBERT-210m, supporting Matryoshka embedding technology

Text Embedding

Safetensors

ArabicOpen Source License:MIT #Arabic Semantic Embedding #Matryoshka Dimensionality Reduction #Long Text Support (8k)

Downloads 304

Release Time : 3/11/2025

Model Overview

A sentence transformation model optimized for Arabic text, capable of mapping sentences to a 768-dimensional vector space, supporting various embedding dimensions to meet different efficiency needs

Model Features

Matryoshka Embedding Technology

Supports flexible adjustment of embedding dimensions (768/512/256/128/64), balancing performance and efficiency without retraining

Long Text Support

Maximum sequence length of 8,192 tokens, suitable for processing long documents

Arabic Language Optimization

Specifically optimized for Arabic language characteristics, showing significant improvement in STS tasks compared to the base model

Multi-Loss Function Training

Combines MatryoshkaLoss with MultipleNegativesRankingLoss for training

Model Capabilities

Semantic Text Similarity Calculation

Semantic Search

Information Retrieval

Document Clustering

Question Answering Systems

Paraphrase Detection

Zero-Shot Classification

Use Cases

Information Retrieval

Arabic Search Engine

Used to build semantic search engines for Arabic content

Improves relevance and accuracy of search results

Text Analysis

Document Similarity Analysis

Analyzes semantic similarity between Arabic documents

73.5% relative improvement on STS17 task

🚀 Ara-EuroBERT: Arabic-optimized Sentence Transformer

Ara-EuroBERT is a sentence-transformers model fine-tuned from EuroBERT/EuroBERT-210m, specifically optimized for Semantic Arabic text embeddings. It maps sentences and paragraphs to a 768-dimensional dense vector space with a Maximum Sequence Length of 8,192 tokens.

🚀 Quick Start

Ara-EuroBERT is a powerful tool for various Arabic NLP tasks. To get started, you can follow the installation and usage steps below.

✨ Features

Optimized for Arabic: Specifically tailored for semantic Arabic text embeddings.
High-dimensional Vector Space: Maps sentences and paragraphs to a 768 - dimensional dense vector space.
Long Sequence Handling: Supports a maximum sequence length of 8,192 tokens.
Matryoshka Embeddings: Allows for flexible embedding dimensionality without retraining.
Strong Performance: Shows remarkable improvements over the base model on STS17 and STS22.v2.

📦 Installation

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/AraEuroBert-210M")

# Encode Arabic sentences
sentences = [
    'التقدم العلمي في مجال الذكاء الاصطناعي يتسارع بشكل ملحوظ في السنوات الأخيرة',
    'تطوير نماذج لغوية متقدمة يساهم في تحسين فهم اللغة العربية آليًا',
    'أصبحت تقنيات معالجة اللغات الطبيعية جزءًا أساسيًا من التطبيقات الحديثة',
    'يعاني الشرق الأوسط من تحديات مناخية متزايدة تهدد الأمن المائي والغذائي',
    'تراث الأدب العربي غني بالقصائد والروايات التي تعكس تاريخ وثقافة المنطقة',
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # [3, 768]

# Get the similarity scores
from sentence_transformers import util
similarities = util.cos_sim(embeddings, embeddings)
print(similarities)

Advanced Usage - Using Matryoshka Embeddings

# Get embeddings with different dimensions
embeddings_768 = model.encode(sentences)  # Default: full 768 dimensions
embeddings_256 = model.encode(sentences, truncate_dim=256)  # Use only 256 dimensions
embeddings_64 = model.encode(sentences, truncate_dim=64)  # Use only 64 dimensions

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer with Matryoshka Embeddings
Base model	EuroBERT/EuroBERT-210m
Maximum Sequence Length	8,192 tokens
Output Dimensionality	Matryoshka embeddings with dimensions [768, 512, 256, 128, 64]
Similarity Function	Cosine Similarity
Languages	Optimized for Arabic
License	Same as EuroBERT (MIT)

Matryoshka Embeddings

This model is trained with Matryoshka Representation Learning, enabling flexible embedding dimensionality without retraining. You can choose smaller dimensions (64, 128, 256, 512) for efficiency or the full 768 dimensions for maximum performance. The model maintains strong performance even at reduced dimensions:

Dimension	Spearman Correlation (STS Dev)
768	0.8101
512	0.8088
256	0.8081
128	0.8055
64	0.7976

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: EuroBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Use Cases

This model is well - suited for various Arabic NLP tasks:

Semantic textual similarity
Semantic search and information retrieval
Document clustering and classification
Question answering
Paraphrase detection
Zero - shot classification

Training Method

Loss Function: MatryoshkaLoss with MultipleNegativesRankingLoss
Matryoshka Dimensions: [768, 512, 256, 128, 64]
Batch Size: 32
Epochs: 1 (with early stopping)
Optimizer: AdamW
Learning Rate: 5e - 05 with linear scheduler and 10% warmup
Hardware: Multiple NVIDIA GPUs with mixed precision (fp16)

Base Model: EuroBERT

EuroBERT is a new family of multilingual encoder models designed specifically for European and widely spoken global languages. It offers several advantages over traditional multilingual encoders:

Extensive Multilingual Coverage: Trained on a 5 trillion - token dataset across 15 languages.
Advanced Architecture: Uses grouped query attention, rotary position embeddings, and RMS normalization.
Long Context Support: Natively processes up to 8,192 tokens.
Specialized Knowledge: Includes math and programming language data for improved reasoning.

Limitations and Recommendations

⚠️ Important Note

The model is primarily optimized for Arabic text and may not perform optimally on other languages. Performance may vary for specialized domains not well - represented in the training data.

💡 Usage Tip

For short texts (<5 words), consider augmenting with context for better representations. For extremely long documents, consider splitting into meaningful chunks before encoding.

Citation

If you use this model in your research, please cite the following works:

@misc{boizard2025eurobertscalingmultilingualencoders,
      title={EuroBERT: Scaling Multilingual Encoders for European Languages}, 
      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
      year={2025},
      eprint={2503.05500},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.05500}, 
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

📄 License

This model is licensed under the MIT license, same as EuroBERT.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご