🚀 imvladikon/sentence-transformers-alephbert[WIP]
This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. The current version is a distillation of the LaBSE model on a private corpus.
🚀 Quick Start
📦 Installation
If you have sentence-transformers installed, using this model is straightforward:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = [
"הם היו שמחים לראות את האירוע שהתקיים.",
"לראות את האירוע שהתקיים היה מאוד משמח להם."
]
model = SentenceTransformer('imvladikon/sentence-transformers-alephbert')
embeddings = model.encode(sentences)
print(cos_sim(*tuple(embeddings)).item())
Advanced Usage
Without sentence-transformers, you can use the model as follows: First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.
import torch
from torch import nn
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = [
"הם היו שמחים לראות את האירוע שהתקיים.",
"לראות את האירוע שהתקיים היה מאוד משמח להם."
]
tokenizer = AutoTokenizer.from_pretrained('imvladikon/sentence-transformers-alephbert')
model = AutoModel.from_pretrained('imvladikon/sentence-transformers-alephbert')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
cos_sim = nn.CosineSimilarity(dim=0, eps=1e-6)
print(cos_sim(sentence_embeddings[0], sentence_embeddings[1]).item())
📚 Documentation
Evaluation Results
For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net
Training
The model was trained with the following parameters:
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 44999 with parameters:
{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss
with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit()-Method:
{
"epochs": 10,
"evaluation_steps": 0,
"evaluator": "NoneType",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 44999,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Citing & Authors
@misc{seker2021alephberta,
title={AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With},
author={Amit Seker and Elron Bandel and Dan Bareket and Idan Brusilovsky and Refael Shaked Greenfeld and Reut Tsarfaty},
year={2021},
eprint={2104.04052},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{reimers2019sentencebert,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Nils Reimers and Iryna Gurevych},
year={2019},
eprint={1908.10084},
archivePrefix={arXiv},
primaryClass={cs.CL}
}