Open-source sbert-large-cased-pl Model - Free Generation of Polish Sentence Embeddings and Similarity Comparison

Sbert Large Cased Pl

Developed by Voicelab

SHerbert large is an improved SentenceBERT model based on the Polish HerBERT, designed to generate semantically meaningful sentence embeddings and compare them using cosine similarity.

Text Embedding

PyTorch

Other#Polish sentence embeddings #Semantic similarity calculation #Wikipedia training

Downloads 327

Release Time : 4/13/2022

Model Overview

This model is an enhancement of the pre-trained BERT network, utilizing siamese and triplet network architectures to generate sentence embeddings, primarily for semantic textual similarity tasks.

Model Features

Semantic sentence embeddings

Generates semantically meaningful sentence embeddings that can be compared using cosine similarity.

Efficient pre-training

Based on the Polish HerBERT language model, it employs character-level byte pair encoding for efficient training.

High performance

Achieves 84.42% accuracy on Polish text similarity tasks, outperforming similar models.

Model Capabilities

Sentence similarity calculation

Semantic feature extraction

Polish text processing

Use Cases

Text similarity analysis

Wikipedia content similarity analysis

Compare semantic similarity between Wikipedia entries

Accurately identifies entries on related topics

Information retrieval

Relevant document retrieval

Find semantically similar documents based on query sentences

Improves relevance of search results

🚀 SHerbert large - Polish SentenceBERT

SentenceBERT is a modified pretrained BERT network. It uses siamese and triplet network structures to generate semantically meaningful sentence embeddings, which can be compared via cosine - similarity. This model aims to create different embeddings based on the semantic and topic similarity of the input text. Training was based on the paper Siamese BERT models for the task of semantic textual similarity (STS), with a minor adjustment in the use of training data.

Semantic textual similarity analyzes how similar two pieces of texts are.

For more information on how the model was prepared, check our blog post. The base trained model is a Polish HerBERT, a BERT - based Language Model. For more details, refer to: "HerBERT: Efficiently Pretrained Transformer - based Language Model for Polish".

🚀 Quick Start

The model is ready to use for sentence similarity tasks. You can follow the usage examples below to get started.

✨ Features

Semantic Embeddings: Generates semantically meaningful sentence embeddings.
Cosine - Similarity Comparison: Allows for easy comparison of sentence similarity using cosine - similarity.
Based on HerBERT: Utilizes the Polish HerBERT as the base model.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import pairwise

sbert = AutoModel.from_pretrained("Voicelab/sbert-large-cased-pl")
tokenizer = AutoTokenizer.from_pretrained("Voicelab/sbert-large-cased-pl")

s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "

tokens = tokenizer([s0, s1, s2], 
                    padding=True, 
                    truncation=True,
                    return_tensors='pt')
x = sbert(tokens["input_ids"],
            tokens["attention_mask"]).pooler_output

# similarity between sentences s0 and s1
print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.8011128

# similarity between sentences s0 and s2
print(pairwise.cosine_similarity(x[0], x[2])) # Result: 0.58822715

📚 Documentation

Corpus

The model was trained solely on Wikipedia.

Tokenizer

As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character - level byte - pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with the tokenizers library. We recommend using the Fast version of the tokenizer, namely HerbertTokenizerFast.

Results

Property	Details
Model Type	SHerbert large - Polish SentenceBERT
Training Data	Wikipedia

Model	Accuracy	Source
SBERT - WikiSec - base (EN)	80.42%	https://arxiv.org/abs/1908.10084
SBERT - WikiSec - large (EN)	80.78%	https://arxiv.org/abs/1908.10084
sbert - base - cased - pl	82.31%	https://huggingface.co/Voicelab/sbert - base - cased - pl
sbert - large - cased - pl	84.42%	https://huggingface.co/Voicelab/sbert - large - cased - pl

📄 License

CC BY 4.0

📖 Citation

If you use this model, please cite the following paper:

👥 Authors

The model was trained by the NLP Research Team at Voicelab.ai. You can contact us here.

logo voicelab nlp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご