Arabic-Retrieval-v1.0 Open-Source Arabic Information Retrieval Model - High Performance Adapted to Language Features

Arabic Retrieval V1.0

Developed by omarelshehy

A high-performance Arabic information retrieval model built on the sentence-transformers framework, optimized for the richness and complexity of the Arabic language.

Text Embedding

Safetensors

ArabicOpen Source License:Apache-2.0 #Arabic retrieval #High-performance lightweight #Sentence similarity

Downloads 366

Release Time : 12/3/2024

Model Overview

This model specializes in Arabic information retrieval, delivering state-of-the-art performance and is optimized for the nuances and dialects of Arabic. Suitable for applications such as search engines and chatbots.

Model Features

Outstanding performance

Matches the accuracy of top multilingual models like e5-multilingual-large.

Arabic-focused

Designed specifically for the nuances and dialects of Arabic, ensuring more accurate and context-aware results.

Lightweight and efficient

Reduces memory requirements by 25%-50%, making it ideal for resource-constrained environments or edge deployments.

Model Capabilities

Arabic information retrieval

Sentence similarity calculation

Feature extraction

Use Cases

Information retrieval

Arabic search engine

Used to build efficient Arabic search engines, delivering accurate query results.

Performs excellently in multiple Arabic retrieval benchmark tests.

Chatbot

Used for context understanding and response generation in Arabic chatbots.

Accurately understands Arabic queries and provides relevant responses.

🚀 Arabic-Retrieval-v1.0

This is a high - performance Arabic information retrieval model built using the robust sentence - transformers framework. It delivers state - of - the - art performance and is tailored to the richness and complexity of the Arabic language.

🚀 Quick Start

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then, you can load this model and run inference. It is important to add the prefixes <query>: and <passage>: to your queries and passages while retrieving in the following way:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("omarelshehy/Arabic-Retrieval-v1.0")

# Query 
query = "<query>: كيف يمكن للذكاء الاصطناعي تحسين طرق التدريس التقليدية؟"

# Passages
passages = [
    "<passage>: طرق التدريس التقليدية تستفيد من الذكاء الاصطناعي عبر تحسين عملية المتابعة وتخصيص التجربة التعليمية. يقوم الذكاء الاصطناعي بتحليل بيانات الطلاب وتقديم توصيات فعالة للمعلمين حول طرق التدريس الأفضل.",
    "<passage>: تطوير التعليم الشخصي يعتمد بشكل كبير على الذكاء الاصطناعي، الذي يقوم بمتابعة تقدم الطلاب بشكل فردي. يقدم الذكاء الاصطناعي حلولاً تعليمية مخصصة لكل طالب بناءً على مستواه وأدائه.",
    "<passage>: الدقة في تقييم الطلاب تتزايد بفضل الذكاء الاصطناعي الذي يقارن النتائج مع معايير متقدمة. بالرغم من التحديات التقليدية، الذكاء الاصطناعي يوفر أدوات تحليل تتيح تقييماً أدق لأداء الطلاب."
]

# Encode query and passages 
embeddings_query = model.encode(queries)
embeddings_passages = model.encode(passages)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings_query, embeddings_passages)

# Get best matching passage to query
best_match = passages[similarities.argmax().item()]
print(f"Best matching passage is {best_match}")

✨ Features

🔥 Outstanding Performance: Matches the accuracy of top - tier multilingual models like e5 - multilingual - large. See [evaluation](https://huggingface.co/omarelshehy/Arabic - retrieval - v1.0#evaluation)
💡 Arabic - Focused: Designed specifically for the nuances and dialects of Arabic, ensuring more accurate and context - aware results.
📉 Lightweight Efficiency: Requires 25% - 50% less memory, making it ideal for environments with limited resources or edge deployments.

🌍 Why This Model?

Multilingual models are powerful, but they’re often bulky and not optimized for specific languages. This model bridges that gap, offering Arabic - native capabilities without sacrificing performance or efficiency. Whether you’re working on search engines, chatbots, or large - scale NLP pipelines, this model provides a fast, accurate, and resource - efficient solution.

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Maximum Sequence Length	512 tokens
Output Dimensionality	768 tokens
Similarity Function	Cosine Similarity

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Evaluation

This model has been evaluated using 3 different datasets and the NDCG@10 metric:

Dataset 1: [castorini/mr - tydi](https://huggingface.co/datasets/castorini/mr - tydi)
Dataset 2: [Omartificial - Intelligence - Space/Arabic - finanical - rag - embedding - dataset](https://huggingface.co/datasets/Omartificial - Intelligence - Space/Arabic - finanical - rag - embedding - dataset)
Dataset 3: [sadeem - ai/sadeem - ar - eval - retrieval - questions](https://huggingface.co/datasets/sadeem - ai/sadeem - ar - eval - retrieval - questions)

And it is compared to other highly performant models:

model	1	2	3
Arabic - Retrieval - v1.0	0.875	0.72	0.679
intfloat/multilingual - e5 - large	0.89	0.719	0.698
intfloat/multilingual - e5 - base	0.87	0.69	0.686

📄 License

This model is licensed under the apache - 2.0 license.

🔧 Technical Details

Citation

BibTeX

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al - Rfou and Brian Strope and Yun - hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご