SILMA Arabic Matryoshka Embedding Model 0.1 Open Source - Optimize Arabic text representation and balance speed and accuracy

Silma Embedding Matryoshka V0.1

Developed by silma-ai

SILMA Arabic Matryoshka Embedding Model 0.1 is an advanced Arabic text embedding model that uses innovative matryoshka embedding technology to optimize text representation in different dimensions, balancing speed, storage, and accuracy.

Text Embedding Supports Multiple LanguagesOpen Source License:Apache-2.0 #Arabic embedding #Matryoshka dimension optimization #Multilingual support

Downloads 446

Release Time : 10/12/2024

Model Overview

This model aims to generate powerful and context-rich text representations, suitable for a wide range of application scenarios from semantic search to document classification.

Model Features

Matryoshka embedding technology

It can be optimized in different dimensions to balance speed, storage, and accuracy. Even in extremely low dimensions (such as 8), it can produce acceptable semantic similarity scores.

Multilingual support

Supports Arabic and English, suitable for cross - language tasks.

High - performance evaluation

It has been comprehensively evaluated on multiple datasets, including MTEB MassiveIntentClassification, MTEB MassiveScenarioClassification, and MTEB STS17.

Model Capabilities

Text embedding

Sentence similarity calculation

Semantic search

Document classification

Use Cases

Semantic search

Short sentence similarity

Calculate the semantic similarity between short sentences, such as 'The weather today is sunny' and 'The weather today was sunny and wonderful'.

The similarity is 0.479942 in 768 dimensions and 0.509289 in 256 dimensions.

Long sentence similarity

Calculate the semantic similarity between long sentences, such as 'The book talks about the importance of artificial intelligence in the development of modern societies' and 'In this book, the author discusses how technology can change the world'.

The similarity is 0.637418 in 768 dimensions and 0.614761 in 256 dimensions.

Question - answer matching

Question - paragraph matching

Match questions with relevant paragraphs, such as 'What are the benefits of exercising?' and 'Regular exercise helps improve general health and physical fitness'.

The similarity is 0.520329 in 768 dimensions and 0.556088 in 256 dimensions.

🚀 SILMA Arabic Matryoshka Embedding Model 0.1

The SILMA Arabic Matryoshka Embedding Model 0.1 is an advanced Arabic text embedding model. It can generate powerful and context - rich text representations, enabling a wide range of applications, from semantic search to document classification. This model uses the innovative Matryoshka Embedding technique, which can be applied in different dimensions to optimize the trade - offs among speed, storage, and accuracy.

🚀 Quick Start

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then load the model:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd

model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)

💻 Usage Examples

🔍 Basic Usage

Using Matryoshka, you can specify the first (n) dimensions to represent each text. In the following samples, you can check how each dimension affects the cosine similarity between a query and the two inputs. You'll notice that in most cases, even a very low dimension (e.g., 8) can produce acceptable semantic similarity scores.

[+] Short Sentence Similarity

query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.479942 |      0.233572 |
# |   256 | True        |      0.509289 |      0.208452 |
# |    48 | True        |      0.598825 |      0.191677 |
# |    16 | True        |      0.917707 |      0.458854 |
# |     8 | True        |      0.948563 |      0.675662 |

[+] Long Sentence Similarity

query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.637418 |      0.262693 |
# |   256 | True        |      0.614761 |      0.268267 |
# |    48 | True        |      0.758887 |      0.384649 |
# |    16 | True        |      0.885737 |      0.204213 |
# |     8 | True        |      0.918684 |      0.146478 |

[+] Question to Paragraph Matching

query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.520329 |    0.00295128 |
# |   256 | True        |      0.556088 |   -0.017764   |
# |    48 | True        |      0.586194 |   -0.110691   |
# |    16 | True        |      0.606462 |   -0.331682   |

📚 Documentation

Model Information

Property	Details
Base Model	aubmindlab/bert-base-arabertv02
Library Name	sentence-transformers
Metrics	pearson_cosine, spearman_cosine, pearson_manhattan, spearman_manhattan, pearson_euclidean, spearman_euclidean, pearson_dot, spearman_dot, pearson_max, spearman_max
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, sentence-similarity, feature-extraction, generated_from_trainer, loss:CosineSimilarityLoss, mteb

Model Index

The model has been tested on multiple datasets, and here are the detailed results:

MTEB MassiveIntentClassification:
- (ar) Test: Accuracy: 56.445864156018835, F1: 53.58282538318122, F1 Weighted: 56.821808211639315, Main Score: 56.445864156018835
- (en) Test: Accuracy: 47.40080699394754, F1: 44.729286773524755, F1 Weighted: 47.83506683571795, Main Score: 47.40080699394754
- (ar) Validation: Accuracy: 56.97983275946876, F1: 53.809263807080086, F1 Weighted: 57.14993215193604, Main Score: 56.97983275946876
- (en) Validation: Accuracy: 47.683226758485006, F1: 44.905317333393775, F1 Weighted: 48.051379514830195, Main Score: 47.683226758485006
MTEB MassiveScenarioClassification:
- (ar) Test: Accuracy: 63.31876260928042, F1: 63.197056314678754, F1 Weighted: 62.7166315473092, Main Score: 63.31876260928042
- (en) Test: Accuracy: 53.35574983187627, F1: 50.35837223252574, F1 Weighted: 54.11644042208904, Main Score: 53.35574983187627
- (ar) Validation: Accuracy: 62.26758484997541, F1: 62.477928166560325, F1 Weighted: 61.92238394647396, Main Score: 62.26758484997541
- (en) Validation: Accuracy: 52.62174126906049, F1: 50.470501485026716, F1 Weighted: 53.16459392827557, Main Score: 52.62174126906049
MTEB STS17:
- (en - en) Test: Cosine Pearson: 74.33941506827517, Cosine Spearman: 74.42197838273297, Euclidean Pearson: 75.33836191339782, Euclidean Spearman: 74.37385193453852, Main Score: 74.42197838273297
- (nl - en) Test: Cosine Pearson: 31.84872826199112, Cosine Spearman: 32.22496230755917, Euclidean Pearson: 21.830860533929688, Euclidean Spearman: 21.38205815348658, Main Score: 32.22496230755917
- (en - ar) Test: Cosine Pearson: 43.37529327788584, Cosine Spearman: 42.763149514327225, Euclidean Pearson: 39.625411905897394, Euclidean Spearman: 39.26727199746294, Main Score: 42.763149514327225
- (en - tr) Test: Cosine Pearson: 17.16722415938186, Cosine Spearman: 15.590330355526344, Euclidean Pearson: 4.430499555984906, Euclidean Spearman: 2.729050802084264, Main Score: 15.590330355526344
- (fr - en) Test: Cosine Pearson: 36.093945717347395, Cosine Spearman: 37.33997345407934, Euclidean Pearson: 23.156103022485055, Euclidean Spearman: 20.62925594786342, Main Score: 37.33997345407934
- (en - de) Test: Cosine Pearson: 29.064411455563, Cosine Spearman: 29.232781114344697, Euclidean Pearson: 16.90458086330736, Euclidean Spearman: 17.462020565289887, Main Score: 29.232781114344697
- (es - en) Test: Cosine Pearson: 27.686316587339473, Cosine Spearman: 28.650995973102205, Euclidean Pearson: 12.954885279630565, Euclidean Spearman: 11.970815927480198, Main Score: 28.650995973102205
- (ar - ar) Test: Cosine Pearson: 84.12612492708037, Cosine Spearman: 84.24703763883515, Euclidean Pearson: 81.38085140113648, Euclidean Spearman: 83.17403450502965, Main Score: 84.24703763883515
- (it - en) Test: Cosine Pearson: 27.697680546701868, Cosine Spearman: 25.19277336255784, Euclidean Pearson: 13.964798090314115, Euclidean Spearman: 10.512169361528596, Main Score: 25.19277336255784
MTEB STS22.v2:
- (de - en) Test: Cosine Pearson: 32.87548760760924, Cosine Spearman: 30.69782036694315, Euclidean Pearson: 29.925045225262142, Euclidean Spearman: 34.076021250318334, Main Score: 30.69782036694315
- (zh - en) Test: Cosine Pearson: 23.93269292232737, Cosine Spearman: 16.781461291066496, Euclidean Pearson: 20.87679825681155, Euclidean Spearman: 13.764510796592536, Main Score: 16.781461291066496
- (ar) Test: Cosine Pearson: 51.73784691362425, Cosine Spearman: 60.01035490847343, Euclidean Pearson: 52.717195602630305, Euclidean Spearman: 60.22164097529916, Main Score: 60.01035490847343
- (es - en) Test: Cosine Pearson: 47.917244237624864, Cosine Spearman: 53.23173373821509, Euclidean Pearson: 48.172861539004636, Euclidean Spearman: 53.32970069145014, Main Score: 53.23173373821509
- (pl - en) Test: Cosine Pearson: 43.66748993183993, Cosine Spearman: 38.518248671828594, Euclidean Pearson: 50.475058499541134, Euclidean Spearman: 44.76070858743843, Main Score: 38.518248671828594
- (en) Test: Cosine Pearson: 56.41373213565263, Cosine Spearman: 59.03774516602592, Euclidean Pearson: 54.173092638047294, Euclidean Spearman: 59.130444355085885, Main Score: 59.03774516602592

📄 License

This project is licensed under the apache - 2.0 license.

Supported Languages

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご