Sambert Open-Source Hebrew Embedding Model - Free Implementation of Sentence and Paragraph Clustering and Semantic Search

Sambert

Developed by MPA

This is a Hebrew embedding model based on sentence-transformers, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as clustering or semantic search.

Text Embedding

Transformers

Other#Hebrew sentence embedding #Multi-stage training #Semantic similarity calculation

Downloads 149

Release Time : 1/31/2024

Model Overview

This model is specifically designed for Hebrew, efficiently converting text into vector representations, supporting sentence similarity calculation and semantic search functionality.

Model Features

Hebrew Optimization

Specifically optimized for Hebrew text, better handling the linguistic characteristics of Hebrew.

Two-stage Training

Adopts unsupervised and supervised two-stage training strategy to enhance model performance.

Efficient Vectorization

Efficiently maps sentences and paragraphs into a 768-dimensional dense vector space.

Model Capabilities

Sentence similarity calculation

Text feature extraction

Semantic search

Text clustering

Use Cases

Text Processing

Semantic Search

Utilizes model-generated vectors for efficient semantic search.

Text Clustering

Automatically categorizes texts with similar content.

🚀 Sambert - Embeddings Model for Hebrew

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

✨ Features

Maps sentences & paragraphs to a 768-dimensional dense vector space.
Suitable for tasks like clustering or semantic search.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer, util
sentences = ["אמא הלכה לגן", "אבא הלך לגן", "ירקוני קונה לנו פיצות"]

model = SentenceTransformer('MPA/sambert')
embeddings = model.encode(sentences)
print(util.cos_sim(embeddings, embeddings))

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["אמא הלכה לגן", "אבא הלך לגן", "ירקוני קונה לנו פיצות"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('MPA/sambert')
model = AutoModel.from_pretrained('MPA/sambert')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

This model was trained in 2 stages:

Unsupervised - ~2M paragraphs with 'MultipleNegativesRankingLoss' on cls-token
Supervised - ~70k paragraphs with 'CosineSimilarityLoss'

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 11672 with parameters:

{'batch_size': 4, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 500,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📄 License

This section is not provided in the original README, so it is skipped.

Citing & Authors

Based on @misc{gueta2022large, title={Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All}, author={Eylon Gueta and Avi Shmidman and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Joshua Guedalia and Moshe Koppel and Dan Bareket and Amit Seker and Reut Tsarfaty}, year={2022}, eprint={2211.15199}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご