Open-source sbert-base-cased-pl model - Generate sentence embeddings for Polish and compare sentence similarities

Sbert Base Cased Pl

Developed by Voicelab

SHerbert is a SentenceBERT implementation based on the Polish HerBERT model, designed to generate semantically meaningful sentence embeddings, supporting sentence similarity comparison via cosine similarity.

Text Embedding

PyTorch

Other#Polish sentence semantic similarity #Wikipedia pre-training #Siamese network architecture

Downloads 1,606

Release Time : 4/11/2022

Model Overview

This model is an improvement over the pre-trained BERT network, employing siamese and triplet network structures to generate sentence embeddings, specifically for semantic textual similarity tasks.

Model Features

Polish language optimization

Specially optimized based on the Polish HerBERT model, suitable for processing Polish text.

Semantic similarity calculation

Capable of generating semantically meaningful sentence embeddings, supporting sentence similarity comparison via cosine similarity.

Efficient training

Trained exclusively on Wikipedia data, maintaining the model's efficiency.

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Polish text processing

Use Cases

Text analysis

Similar sentence retrieval

Find semantically similar sentences within documents.

Topic classification

Classify topics based on sentence semantics.

🚀 SHerbert - Polish SentenceBERT

SentenceBERT is a modification of the pretrained BERT network. It uses siamese and triplet network structures to derive semantically meaningful sentence embeddings, which can be compared using cosine - similarity. The model aims to generate different embeddings based on the semantic and topic similarity of the given text.

Semantic textual similarity analyzes how similar two pieces of texts are.

Read more about how the model was prepared in our blog post.

🚀 Quick Start

SentenceBERT modifies the pretrained BERT network. It employs siamese and triplet network structures to obtain semantically significant sentence embeddings, enabling comparison via cosine - similarity. Training was based on the original paper Siamese BERT models for the task of semantic textual similarity (STS), with a minor adjustment in how the training data was utilized.

The base trained model is a Polish HerBERT, which is a BERT - based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer - based Language Model for Polish".

✨ Features

Semantic Embeddings: Generates semantically meaningful sentence embeddings for text similarity analysis.
Based on HerBERT: Utilizes the Polish HerBERT as the base model.
Trained on Wikipedia: The model was trained solely on Wikipedia.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import pairwise

sbert = AutoModel.from_pretrained("Voicelab/sbert-base-cased-pl")
tokenizer = AutoTokenizer.from_pretrained("Voicelab/sbert-base-cased-pl")

s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "


tokens = tokenizer([s0, s1, s2], 
                    padding=True, 
                    truncation=True,
                    return_tensors='pt')
x = sbert(tokens["input_ids"],
            tokens["attention_mask"]).pooler_output

# similarity between sentences s0 and s1
print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.7952354

# similarity between sentences s0 and s2
print(pairwise.cosine_similarity(x[0], x[2])) # Result: 0.42359722

📚 Documentation

Corpus

The model was trained solely on Wikipedia.

Tokenizer

As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte - pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.

We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.

Results

Property	Details
SBERT - WikiSec - base (EN)	Accuracy: 80.42%, Source: https://arxiv.org/abs/1908.10084
SBERT - WikiSec - large (EN)	Accuracy: 80.78%, Source: https://arxiv.org/abs/1908.10084
sbert - base - cased - pl	Accuracy: 82.31%, Source: https://huggingface.co/Voicelab/sbert - base - cased - pl
sbert - large - cased - pl	Accuracy: 84.42%, Source: https://huggingface.co/Voicelab/sbert - large - cased - pl

📄 License

CC BY 4.0

📚 Citation

If you use this model, please cite the following paper:

👥 Authors

The model was trained by NLP Research Team at Voicelab.ai.

You can contact us here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご