🚀 SA-BERT-V1: Saudi-Dialect Embeddings
SA-BERT-V1 offers high - quality sentence embeddings specifically tailored for the Saudi dialect, enabling effective semantic analysis and classification tasks.
🚀 Quick Start
The following is a simple example to show you how to use SA - BERT - V1 to generate sentence embeddings.
import torch
from transformers import AutoTokenizer, AutoModel
MODEL_ID = "Omartificial-Intelligence-Space/SA-BERT-V1"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID , token= "PASS_READ_TOKEN_HERE")
model = AutoModel.from_pretrained(MODEL_ID , token = "PASS_READ_TOKEN_HERE").to(DEVICE).eval()
def embed_sentence(text: str) -> torch.Tensor:
"""
Tokenizes `text`, feeds it through SA-BERT-V1, and returns
a 768-dimensional mean-pooled sentence embedding.
"""
enc = tokenizer(
text,
truncation=True,
padding="max_length",
max_length=256,
return_tensors="pt"
).to(DEVICE)
with torch.no_grad():
outputs = model(**enc).last_hidden_state
mask = enc["attention_mask"].unsqueeze(-1)
summed = (outputs * mask).sum(dim=1)
counts = mask.sum(dim=1).clamp(min=1e-9)
embedding = summed / counts
return embedding.squeeze(0)
if __name__ == "__main__":
sentences = [
"شتبي من البقالة؟",
"كيف حالك؟",
"وش رايك في الموضوع هذا؟"
]
for s in sentences:
vec = embed_sentence(s)
print(f"Sentence: {s}\nEmbedding shape: {vec.shape}\n")
✨ Features
SA - BERT - V1 delivers unparalleled Saudi - dialect understanding—achieving a +0.0022 in - vs - cross similarity gap and 0.98 mean cosine scores across 44 specialized categories, setting a new standard for Arabic dialect sentence embeddings.
- Positive In–Cross Gap and High Similarity: SA - BERT - V1 shows a positive in–cross gap and high absolute similarity, proving the effectiveness of targeted Saudi - dialect fine - tuning.
- Exceptional Performance: Both in - category and cross - category similarities are near ~0.98, with a slight positive gap (+0.0023), meaning same - topic embeddings are closer. It has exceptional clustering for Saudi dialect and is ideal for retrieval or grouping tasks.
📚 Documentation
Model Details
Property |
Details |
Fine - Tuned Model ID |
Omartificial - Intelligence - Space/SA - BERT - V1 |
License |
Apache 2.0 |
Designed For |
Saudi Dialect |
Model Type |
Sentence - Embedding (BERT encoder with mean - pooling) |
Architecture |
12 - layer Transformer, 768 - dim hidden states |
Embedding Size |
768 |
Pretrained On |
UBC - NLP/MARBERTv2 |
Fine - Tuned On |
Over 500K Saudi - dialect sentences covering diverse topics and regional variations (Hijazi, Najdi, and more) |
Supported Language |
Arabic (Saudi dialect) |
Intended Tasks |
Semantic similarity, clustering, retrieval, downstream classification |
Evaluation Details
- The evaluations—both the similarity metrics and the “in - vs - cross” gap plots—were run on a held - out test set of 1280 Saudi - dialect sentences covering 44 diverse categories (e.g. Greetings, Weather, Law & Justice, etc.).
- Dataset: The dataset is created by the space and released to evaluate embedding models by sampling intra - category and cross - category pairs from that set to compute:
- Average in - category / cross - category cosine similarities
- Top - 5 most/least similar pairs
- Per - category average similarities
- Access Test Samples: [saudi - dialect - test - samples](https://huggingface.co/datasets/Omartificial - Intelligence - Space/saudi - dialect - test - samples)
📄 License
This model is licensed under the Apache 2.0 license.
📚 Citation
If you use MarBERTv2 - SA in your research or applications, please cite:
@misc{nacar2025SABERTV1,
title={SA-BERT-V1: Fine-Tuned Saudi-Dialect Embeddings},
author={Nacar, Omer & Sibaee, Serry},
year={2025},
publisher={Omartificial-Intelligence-Space},
howpublished={\url{https://huggingface.co/Omartificial-Intelligence-Space/SA-BERT-V1}},
}
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
year = "2021",
publisher = "Association for Computational Linguistics",
pages = "7088--7105",
}