SA-BERT-V1 Open-Source Saudi Dialect Embedding Model - High-quality Sentence Embeddings Specifically Designed for the Saudi Dialect

SA BERT V1

Developed by Omartificial-Intelligence-Space

SA-BERT-V1 is a Saudi dialect embedding model fine-tuned based on MARBERTv2, specifically designed for handling Saudi Arabian dialects, providing high-quality sentence embeddings.

Text Embedding

Transformers

ArabicOpen Source License:Apache-2.0 #Saudi dialect embeddings #Arabic semantic understanding #High similarity clustering

Downloads 31

Release Time : 5/12/2025

Model Overview

SA-BERT-V1 is a sentence embedding model optimized for Saudi Arabian dialects, fine-tuned from the UBC-NLP/MARBERTv2 pre-trained model, suitable for semantic similarity, clustering, retrieval, and classification tasks.

Model Features

Saudi Dialect Optimization

Specially fine-tuned for Saudi Arabian dialects, enhancing dialect comprehension and processing capabilities.

High-Performance Embeddings

Improved internal and cross-category similarity gap by +0.0022, achieving an average cosine score of 0.98 across 44 specialized categories.

Diverse Training Data

Fine-tuned using over 500,000 Saudi dialect sentences, covering diverse topics and regional variants.

Model Capabilities

Semantic similarity calculation

Text clustering

Information retrieval

Downstream classification tasks

Use Cases

Natural Language Processing

Saudi Dialect Semantic Similarity Analysis

Used to calculate semantic similarity between Saudi dialect sentences.

Achieved an average cosine similarity of 0.98 on the test set.

Saudi Dialect Text Clustering

Performs clustering analysis on Saudi dialect texts.

Demonstrated excellent performance in Saudi dialect clustering tasks.

Information Retrieval

Saudi Dialect Document Retrieval

Used to build document retrieval systems for Saudi dialects.

🚀 SA-BERT-V1: Saudi-Dialect Embeddings

SA-BERT-V1 offers high - quality sentence embeddings specifically tailored for the Saudi dialect, enabling effective semantic analysis and classification tasks.

MarBERTv2-SA Logo

🚀 Quick Start

The following is a simple example to show you how to use SA - BERT - V1 to generate sentence embeddings.

import torch
from transformers import AutoTokenizer, AutoModel

# Configuration
MODEL_ID = "Omartificial-Intelligence-Space/SA-BERT-V1"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID , token= "PASS_READ_TOKEN_HERE")
model     = AutoModel.from_pretrained(MODEL_ID , token = "PASS_READ_TOKEN_HERE").to(DEVICE).eval()

def embed_sentence(text: str) -> torch.Tensor:
    """
    Tokenizes `text`, feeds it through SA-BERT-V1, and returns
    a 768-dimensional mean-pooled sentence embedding.
    """
    # Encode the text
    enc = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt"
    ).to(DEVICE)

    # Forward pass
    with torch.no_grad():
        outputs = model(**enc).last_hidden_state  # shape: (1, seq_len, 768)

    # Mean-pooling over valid tokens
    mask = enc["attention_mask"].unsqueeze(-1)           # shape: (1, seq_len, 1)
    summed = (outputs * mask).sum(dim=1)                 # shape: (1, 768)
    counts = mask.sum(dim=1).clamp(min=1e-9)              # shape: (1, 1)
    embedding = summed / counts                          # shape: (1, 768)

    return embedding.squeeze(0)  # shape: (768,)

# Example usage
if __name__ == "__main__":
    sentences = [
        "شتبي من البقالة؟",
        "كيف حالك؟",
        "وش رايك في الموضوع هذا؟"
    ]
    for s in sentences:
        vec = embed_sentence(s)
        print(f"Sentence: {s}\nEmbedding shape: {vec.shape}\n")

✨ Features

SA - BERT - V1 delivers unparalleled Saudi - dialect understanding—achieving a +0.0022 in - vs - cross similarity gap and 0.98 mean cosine scores across 44 specialized categories, setting a new standard for Arabic dialect sentence embeddings.

Positive In–Cross Gap and High Similarity: SA - BERT - V1 shows a positive in–cross gap and high absolute similarity, proving the effectiveness of targeted Saudi - dialect fine - tuning.
Exceptional Performance: Both in - category and cross - category similarities are near ~0.98, with a slight positive gap (+0.0023), meaning same - topic embeddings are closer. It has exceptional clustering for Saudi dialect and is ideal for retrieval or grouping tasks.

📚 Documentation

Model Details

Property	Details
Fine - Tuned Model ID	Omartificial - Intelligence - Space/SA - BERT - V1
License	Apache 2.0
Designed For	Saudi Dialect
Model Type	Sentence - Embedding (BERT encoder with mean - pooling)
Architecture	12 - layer Transformer, 768 - dim hidden states
Embedding Size	768
Pretrained On	UBC - NLP/MARBERTv2
Fine - Tuned On	Over 500K Saudi - dialect sentences covering diverse topics and regional variations (Hijazi, Najdi, and more)
Supported Language	Arabic (Saudi dialect)
Intended Tasks	Semantic similarity, clustering, retrieval, downstream classification

Evaluation Details

The evaluations—both the similarity metrics and the “in - vs - cross” gap plots—were run on a held - out test set of 1280 Saudi - dialect sentences covering 44 diverse categories (e.g. Greetings, Weather, Law & Justice, etc.).
Dataset: The dataset is created by the space and released to evaluate embedding models by sampling intra - category and cross - category pairs from that set to compute:
- Average in - category / cross - category cosine similarities
- Top - 5 most/least similar pairs
- Per - category average similarities
Access Test Samples: [saudi - dialect - test - samples](https://huggingface.co/datasets/Omartificial - Intelligence - Space/saudi - dialect - test - samples)

📄 License

This model is licensed under the Apache 2.0 license.

📚 Citation

If you use MarBERTv2 - SA in your research or applications, please cite:

@misc{nacar2025SABERTV1,
  title={SA-BERT-V1: Fine-Tuned Saudi-Dialect Embeddings},
  author={Nacar, Omer & Sibaee, Serry},
  year={2025},
  publisher={Omartificial-Intelligence-Space},
  howpublished={\url{https://huggingface.co/Omartificial-Intelligence-Space/SA-BERT-V1}},
}

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    year = "2021",
    publisher = "Association for Computational Linguistics",
    pages = "7088--7105",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご