🚀 SILMA Arabic Matryoshka Embedding Model 0.1
The SILMA Arabic Matryoshka Embedding Model 0.1 is an advanced Arabic text embedding model. It can generate powerful and context - rich text representations, enabling a wide range of applications, from semantic search to document classification. This model uses the innovative Matryoshka Embedding technique, which can be applied in different dimensions to optimize the trade - offs among speed, storage, and accuracy.
🚀 Quick Start
📦 Installation
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then load the model:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd
model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)
💻 Usage Examples
🔍 Basic Usage
Using Matryoshka, you can specify the first (n)
dimensions to represent each text. In the following samples, you can check how each dimension affects the cosine similarity
between a query and the two inputs. You'll notice that in most cases, even a very low dimension (e.g., 8) can produce acceptable semantic similarity scores.
[+] Short Sentence Similarity
query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
[+] Long Sentence Similarity
query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
[+] Question to Paragraph Matching
query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"
scores = []
for dim in [768, 256, 48, 16, 8]:
query_embedding = model.encode(query)[:dim]
sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()
scores.append({
"dim": dim,
"valid_top": sent1_score > sent2_score,
"sent1_score": sent1_score,
"sent2_score": sent2_score,
})
scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))
📚 Documentation
Model Information
Property |
Details |
Base Model |
aubmindlab/bert-base-arabertv02 |
Library Name |
sentence-transformers |
Metrics |
pearson_cosine, spearman_cosine, pearson_manhattan, spearman_manhattan, pearson_euclidean, spearman_euclidean, pearson_dot, spearman_dot, pearson_max, spearman_max |
Pipeline Tag |
sentence-similarity |
Tags |
sentence-transformers, sentence-similarity, feature-extraction, generated_from_trainer, loss:CosineSimilarityLoss, mteb |
Model Index
The model has been tested on multiple datasets, and here are the detailed results:
- MTEB MassiveIntentClassification:
- (ar) Test: Accuracy: 56.445864156018835, F1: 53.58282538318122, F1 Weighted: 56.821808211639315, Main Score: 56.445864156018835
- (en) Test: Accuracy: 47.40080699394754, F1: 44.729286773524755, F1 Weighted: 47.83506683571795, Main Score: 47.40080699394754
- (ar) Validation: Accuracy: 56.97983275946876, F1: 53.809263807080086, F1 Weighted: 57.14993215193604, Main Score: 56.97983275946876
- (en) Validation: Accuracy: 47.683226758485006, F1: 44.905317333393775, F1 Weighted: 48.051379514830195, Main Score: 47.683226758485006
- MTEB MassiveScenarioClassification:
- (ar) Test: Accuracy: 63.31876260928042, F1: 63.197056314678754, F1 Weighted: 62.7166315473092, Main Score: 63.31876260928042
- (en) Test: Accuracy: 53.35574983187627, F1: 50.35837223252574, F1 Weighted: 54.11644042208904, Main Score: 53.35574983187627
- (ar) Validation: Accuracy: 62.26758484997541, F1: 62.477928166560325, F1 Weighted: 61.92238394647396, Main Score: 62.26758484997541
- (en) Validation: Accuracy: 52.62174126906049, F1: 50.470501485026716, F1 Weighted: 53.16459392827557, Main Score: 52.62174126906049
- MTEB STS17:
- (en - en) Test: Cosine Pearson: 74.33941506827517, Cosine Spearman: 74.42197838273297, Euclidean Pearson: 75.33836191339782, Euclidean Spearman: 74.37385193453852, Main Score: 74.42197838273297
- (nl - en) Test: Cosine Pearson: 31.84872826199112, Cosine Spearman: 32.22496230755917, Euclidean Pearson: 21.830860533929688, Euclidean Spearman: 21.38205815348658, Main Score: 32.22496230755917
- (en - ar) Test: Cosine Pearson: 43.37529327788584, Cosine Spearman: 42.763149514327225, Euclidean Pearson: 39.625411905897394, Euclidean Spearman: 39.26727199746294, Main Score: 42.763149514327225
- (en - tr) Test: Cosine Pearson: 17.16722415938186, Cosine Spearman: 15.590330355526344, Euclidean Pearson: 4.430499555984906, Euclidean Spearman: 2.729050802084264, Main Score: 15.590330355526344
- (fr - en) Test: Cosine Pearson: 36.093945717347395, Cosine Spearman: 37.33997345407934, Euclidean Pearson: 23.156103022485055, Euclidean Spearman: 20.62925594786342, Main Score: 37.33997345407934
- (en - de) Test: Cosine Pearson: 29.064411455563, Cosine Spearman: 29.232781114344697, Euclidean Pearson: 16.90458086330736, Euclidean Spearman: 17.462020565289887, Main Score: 29.232781114344697
- (es - en) Test: Cosine Pearson: 27.686316587339473, Cosine Spearman: 28.650995973102205, Euclidean Pearson: 12.954885279630565, Euclidean Spearman: 11.970815927480198, Main Score: 28.650995973102205
- (ar - ar) Test: Cosine Pearson: 84.12612492708037, Cosine Spearman: 84.24703763883515, Euclidean Pearson: 81.38085140113648, Euclidean Spearman: 83.17403450502965, Main Score: 84.24703763883515
- (it - en) Test: Cosine Pearson: 27.697680546701868, Cosine Spearman: 25.19277336255784, Euclidean Pearson: 13.964798090314115, Euclidean Spearman: 10.512169361528596, Main Score: 25.19277336255784
- MTEB STS22.v2:
- (de - en) Test: Cosine Pearson: 32.87548760760924, Cosine Spearman: 30.69782036694315, Euclidean Pearson: 29.925045225262142, Euclidean Spearman: 34.076021250318334, Main Score: 30.69782036694315
- (zh - en) Test: Cosine Pearson: 23.93269292232737, Cosine Spearman: 16.781461291066496, Euclidean Pearson: 20.87679825681155, Euclidean Spearman: 13.764510796592536, Main Score: 16.781461291066496
- (ar) Test: Cosine Pearson: 51.73784691362425, Cosine Spearman: 60.01035490847343, Euclidean Pearson: 52.717195602630305, Euclidean Spearman: 60.22164097529916, Main Score: 60.01035490847343
- (es - en) Test: Cosine Pearson: 47.917244237624864, Cosine Spearman: 53.23173373821509, Euclidean Pearson: 48.172861539004636, Euclidean Spearman: 53.32970069145014, Main Score: 53.23173373821509
- (pl - en) Test: Cosine Pearson: 43.66748993183993, Cosine Spearman: 38.518248671828594, Euclidean Pearson: 50.475058499541134, Euclidean Spearman: 44.76070858743843, Main Score: 38.518248671828594
- (en) Test: Cosine Pearson: 56.41373213565263, Cosine Spearman: 59.03774516602592, Euclidean Pearson: 54.173092638047294, Euclidean Spearman: 59.130444355085885, Main Score: 59.03774516602592
📄 License
This project is licensed under the apache - 2.0
license.
Supported Languages