GATE-AraBert-v1 Open-Source Arabic Text Embedding Model - Optimizing Semantic Text Similarity Tasks

GATE AraBert V1

Developed by Omartificial-Intelligence-Space

GATE-AraBert-V1 is a general Arabic text embedding model that optimizes the semantic text similarity task on the AllNLI and STS datasets through multi-task training.

Text Embedding ArabicOpen Source License:Apache-2.0 #Arabic semantic embedding #Multi-task training #High similarity accuracy

Downloads 4,418

Release Time : 8/3/2024

Model Overview

This model is an Arabic text embedding system trained based on SentenceTransformers, mainly used to enhance semantic text similarity calculation, and adopts a mixed loss training method.

Model Features

Multi-task training

Conduct multi-task training on the AllNLI and STS datasets to optimize semantic similarity calculation

Mixed loss training

Adopt a mixed loss training method to improve model performance

Long text support

Support a sequence length of up to 512 tokens

High-dimensional embedding

Output high-quality text embeddings of 768 dimensions

Model Capabilities

Arabic text embedding

Semantic similarity calculation

Text representation learning

Use Cases

Natural language processing

Semantic search

Used in the semantic search system for Arabic content

Improve the relevance of search results

Text clustering

Automatic clustering of Arabic documents

Improve document organization efficiency

Question-answering system

Question matching in Arabic question-answering systems

Improve answer accuracy

🚀 GATE-AraBert-V1

This is GATE | General Arabic Text Embedding trained using SentenceTransformers in a multi-task setup. The system trains on the AllNLI and on the STS dataset. It is described in detail in the paper GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Hybrid Loss Training. This model can effectively handle Arabic text embedding tasks, providing high - quality semantic representation for Arabic text.

Project page: https://huggingface.co/collections/Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

🚀 Quick Start

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/GATE-AraBert-v1")
# Run inference
sentences = [
    'الكلب البني مستلقي على جانبه على سجادة بيج، مع جسم أخضر في المقدمة.',
    'لقد مات الكلب',
    'شخص طويل القامة',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ Features

Multi - task Training: Trained on both the AllNLI and STS datasets, enabling it to handle a variety of semantic tasks.
High - quality Embedding: Can generate high - dimensional and semantically rich Arabic text embeddings.
Cosine Similarity: Uses cosine similarity as the similarity function to measure the semantic similarity between texts.

📦 Installation

To use this model, you need to install the Sentence Transformers library. You can install it using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/GATE-AraBert-v1")
# Run inference
sentences = [
    'الكلب البني مستلقي على جانبه على سجادة بيج، مع جسم أخضر في المقدمة.',
    'لقد مات الكلب',
    'شخص طويل القامة',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2
Maximum Sequence Length	512 tokens
Output Dimensionality	768 tokens
Similarity Function	Cosine Similarity
Training Datasets	all - nli, sts
Language	ar

Evaluation

Metrics

Semantic Similarity (sts - dev)

Dataset: sts - dev
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.8391
spearman_cosine	0.841
pearson_manhattan	0.8277
spearman_manhattan	0.8361
pearson_euclidean	0.8274
spearman_euclidean	0.8358
pearson_dot	0.8154
spearman_dot	0.818
pearson_max	0.8391
spearman_max	0.841

Semantic Similarity (sts - test)

Dataset: sts - test
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.813
spearman_cosine	0.8173
pearson_manhattan	0.8114
spearman_manhattan	0.8164
pearson_euclidean	0.8103
spearman_euclidean	0.8158
pearson_dot	0.7908
spearman_dot	0.7887
pearson_max	0.813
spearman_max	0.8173

🔧 Technical Details

The model is based on the Sentence Transformers framework and uses a multi - task training approach. It is trained on the AllNLI and STS datasets to learn semantic information from different aspects. The base model is [Omartificial - Intelligence - Space/Arabic - Triplet - Matryoshka - V2](https://huggingface.co/Omartificial - Intelligence - Space/Arabic - Triplet - Matryoshka - V2), which provides a solid foundation for the model's performance.

📄 License

This project is licensed under the apache - 2.0 license.

👏 Acknowledgments

The author would like to thank Prince Sultan University for their invaluable support in this project. Their contributions and resources have been instrumental in the development and fine - tuning of these models.

📖 Citation

If you use the GATE, please cite it as follows:

@misc{nacar2025GATE,
      title={GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Hybrid Loss Training}, 
      author={Omer Nacar, Anis Koubaa, Serry Taiseer Sibaee and Lahouari Ghouti},
      year={2025},
      note={Submitted to COLING 2025},
      url={https://huggingface.co/Omartificial-Intelligence-Space/GATE-AraBert-v1},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご