Sentence-transformers-alephbert: An Open-source Hebrew Model - Free Deployment for Semantic Search and Clustering

Sentence Transformers Alephbert

Developed by imvladikon

This is a Hebrew sentence embedding model based on AlephBERT, capable of mapping sentences and paragraphs into a 768-dimensional vector space, suitable for tasks such as semantic search and clustering.

Text Embedding

Transformers

Other#Hebrew Sentence Embedding #Semantic Similarity Calculation #Multilingual Transfer Learning

Downloads 4,768

Release Time : 4/4/2023

Model Overview

This model is a sentence transformer specifically designed for generating Hebrew sentence embeddings, supporting sentence similarity calculation and feature extraction.

Model Features

Hebrew-Specific

A sentence embedding model specifically optimized for Hebrew.

High-Dimensional Vector Space

Maps sentences into a 768-dimensional dense vector space.

Distilled from LaBSE

Obtained by distilling the LaBSE model on a private corpus.

Model Capabilities

Sentence Embedding Generation

Semantic Similarity Calculation

Text Feature Extraction

Sentence Clustering

Use Cases

Information Retrieval

Semantic Search

Use sentence embeddings for document retrieval based on semantics rather than keywords.

Text Analysis

Document Clustering

Automatically group Hebrew documents with similar content.

🚀 imvladikon/sentence-transformers-alephbert[WIP]

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. The current version is a distillation of the LaBSE model on a private corpus.

🚀 Quick Start

📦 Installation

If you have sentence-transformers installed, using this model is straightforward:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = [
"הם היו שמחים לראות את האירוע שהתקיים.",
"לראות את האירוע שהתקיים היה מאוד משמח להם."
]

model = SentenceTransformer('imvladikon/sentence-transformers-alephbert')
embeddings = model.encode(sentences)

print(cos_sim(*tuple(embeddings)).item())
# 0.883316159248352

Advanced Usage

Without sentence-transformers, you can use the model as follows: First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.

import torch
from torch import nn
from transformers import AutoTokenizer, AutoModel

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = [
"הם היו שמחים לראות את האירוע שהתקיים.",
"לראות את האירוע שהתקיים היה מאוד משמח להם."
]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('imvladikon/sentence-transformers-alephbert')
model = AutoModel.from_pretrained('imvladikon/sentence-transformers-alephbert')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

cos_sim = nn.CosineSimilarity(dim=0, eps=1e-6)
print(cos_sim(sentence_embeddings[0], sentence_embeddings[1]).item())

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 44999 with parameters:

{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 10,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 44999,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

@misc{seker2021alephberta,
      title={AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With}, 
      author={Amit Seker and Elron Bandel and Dan Bareket and Idan Brusilovsky and Refael Shaked Greenfeld and Reut Tsarfaty},
      year={2021},
      eprint={2104.04052},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{reimers2019sentencebert,
      title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}, 
      author={Nils Reimers and Iryna Gurevych},
      year={2019},
      eprint={1908.10084},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご