🚀 SHerbert large - Polish SentenceBERT
SentenceBERT is a modified pretrained BERT network. It uses siamese and triplet network structures to generate semantically meaningful sentence embeddings, which can be compared via cosine - similarity. This model aims to create different embeddings based on the semantic and topic similarity of the input text. Training was based on the paper Siamese BERT models for the task of semantic textual similarity (STS), with a minor adjustment in the use of training data.
Semantic textual similarity analyzes how similar two pieces of texts are.
For more information on how the model was prepared, check our blog post. The base trained model is a Polish HerBERT, a BERT - based Language Model. For more details, refer to: "HerBERT: Efficiently Pretrained Transformer - based Language Model for Polish".
🚀 Quick Start
The model is ready to use for sentence similarity tasks. You can follow the usage examples below to get started.
✨ Features
- Semantic Embeddings: Generates semantically meaningful sentence embeddings.
- Cosine - Similarity Comparison: Allows for easy comparison of sentence similarity using cosine - similarity.
- Based on HerBERT: Utilizes the Polish HerBERT as the base model.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import pairwise
sbert = AutoModel.from_pretrained("Voicelab/sbert-large-cased-pl")
tokenizer = AutoTokenizer.from_pretrained("Voicelab/sbert-large-cased-pl")
s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "
tokens = tokenizer([s0, s1, s2],
padding=True,
truncation=True,
return_tensors='pt')
x = sbert(tokens["input_ids"],
tokens["attention_mask"]).pooler_output
print(pairwise.cosine_similarity(x[0], x[1]))
print(pairwise.cosine_similarity(x[0], x[2]))
📚 Documentation
Corpus
The model was trained solely on Wikipedia.
Tokenizer
As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character - level byte - pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with the tokenizers library. We recommend using the Fast version of the tokenizer, namely HerbertTokenizerFast.
Results
Property |
Details |
Model Type |
SHerbert large - Polish SentenceBERT |
Training Data |
Wikipedia |
Model |
Accuracy |
Source |
SBERT - WikiSec - base (EN) |
80.42% |
https://arxiv.org/abs/1908.10084 |
SBERT - WikiSec - large (EN) |
80.78% |
https://arxiv.org/abs/1908.10084 |
sbert - base - cased - pl |
82.31% |
https://huggingface.co/Voicelab/sbert - base - cased - pl |
sbert - large - cased - pl |
84.42% |
https://huggingface.co/Voicelab/sbert - large - cased - pl |
📄 License
CC BY 4.0
📖 Citation
If you use this model, please cite the following paper:
👥 Authors
The model was trained by the NLP Research Team at Voicelab.ai. You can contact us here.
