🚀 SHerbert - Polish SentenceBERT
SentenceBERT is a modification of the pretrained BERT network. It uses siamese and triplet network structures to derive semantically meaningful sentence embeddings, which can be compared using cosine - similarity. The model aims to generate different embeddings based on the semantic and topic similarity of the given text.
Semantic textual similarity analyzes how similar two pieces of texts are.
Read more about how the model was prepared in our blog post.
🚀 Quick Start
SentenceBERT modifies the pretrained BERT network. It employs siamese and triplet network structures to obtain semantically significant sentence embeddings, enabling comparison via cosine - similarity. Training was based on the original paper Siamese BERT models for the task of semantic textual similarity (STS), with a minor adjustment in how the training data was utilized.
The base trained model is a Polish HerBERT, which is a BERT - based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer - based Language Model for Polish".
✨ Features
- Semantic Embeddings: Generates semantically meaningful sentence embeddings for text similarity analysis.
- Based on HerBERT: Utilizes the Polish HerBERT as the base model.
- Trained on Wikipedia: The model was trained solely on Wikipedia.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import pairwise
sbert = AutoModel.from_pretrained("Voicelab/sbert-base-cased-pl")
tokenizer = AutoTokenizer.from_pretrained("Voicelab/sbert-base-cased-pl")
s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "
tokens = tokenizer([s0, s1, s2],
padding=True,
truncation=True,
return_tensors='pt')
x = sbert(tokens["input_ids"],
tokens["attention_mask"]).pooler_output
print(pairwise.cosine_similarity(x[0], x[1]))
print(pairwise.cosine_similarity(x[0], x[2]))
📚 Documentation
Corpus
The model was trained solely on Wikipedia.
Tokenizer
As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte - pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
Results
Property |
Details |
SBERT - WikiSec - base (EN) |
Accuracy: 80.42%, Source: https://arxiv.org/abs/1908.10084 |
SBERT - WikiSec - large (EN) |
Accuracy: 80.78%, Source: https://arxiv.org/abs/1908.10084 |
sbert - base - cased - pl |
Accuracy: 82.31%, Source: https://huggingface.co/Voicelab/sbert - base - cased - pl |
sbert - large - cased - pl |
Accuracy: 84.42%, Source: https://huggingface.co/Voicelab/sbert - large - cased - pl |
📄 License
CC BY 4.0
📚 Citation
If you use this model, please cite the following paper:
👥 Authors
The model was trained by NLP Research Team at Voicelab.ai.
You can contact us here.