🚀 NB-SBERT-BASE
NB-SBERT-BASE is a SentenceTransformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, enabling tasks like clustering and semantic search.
🚀 Quick Start
NB-SBERT-BASE is a SentenceTransformers model. It was trained on a machine translated version of the MNLI dataset, starting from nb-bert-base.
This model maps sentences and paragraphs to a 768-dimensional dense vector space. The resulting vectors can be used for tasks such as clustering and semantic search. The easiest way to use the model is to measure the cosine distance between two sentences. Sentences with similar meanings will have a small cosine distance and a similarity close to 1. The model is trained so that similar sentences in different languages are also close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.
✨ Features
- Maps sentences & paragraphs to a 768-dimensional dense vector space.
- Can be used for tasks like clustering and semantic search.
- Enables cross - language sentence similarity measurement.
📦 Installation
Installing the sentence-transformers
Library
pip install -U sentence-transformers
Installing keybert
for Keyword Extraction
pip install keybert
Installing autofaiss
for Similarity Search
pip install autofaiss sentence-transformers
💻 Usage Examples
Basic Usage with sentence-transformers
from sentence_transformers import SentenceTransformer, util
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
model = SentenceTransformer('NbAiLab/nb-sbert-base')
embeddings = model.encode(sentences)
print(embeddings)
cosine_scores = util.cos_sim(embeddings[0], embeddings[1])
print(cosine_scores)
from scipy import spatial
scipy_cosine_scores = 1 - spatial.distance.cosine(embeddings[0], embeddings[1])
print(scipy_cosine_scores)
Keyword Extraction with KeyBERT
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("NbAiLab/nb-sbert-base")
kw_model = KeyBERT(model=sentence_model)
doc = """
De første nasjonale bibliotek har sin opprinnelse i kongelige samlinger eller en annen framstående myndighet eller statsoverhode.
Et av de første planene for et nasjonalbibliotek i England ble fremmet av den walisiske matematikeren og mystikeren John Dee som
i 1556 presenterte en visjonær plan om et nasjonalt bibliotek for gamle bøker, manuskripter og opptegnelser for dronning Maria I
av England. Hans forslag ble ikke tatt til følge.
"""
kw_model.extract_keywords(doc, stop_words=None)
Similarity Search with autofaiss
from autofaiss import build_index
import numpy as np
from sentence_transformers import SentenceTransformer, util
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt", "A red house"]
model = SentenceTransformer('NbAiLab/nb-sbert-base')
embeddings = model.encode(sentences)
index, index_infos = build_index(embeddings, save_on_disk=False)
query = model.encode(["A young boy"])
_, index_matches = index.search(query, 1)
print(index_matches)
📚 Documentation
Embeddings and Sentence Similarity (Sentence - Transformers)
Using the sentence-transformers library makes it convenient to use these models. First, install the library as shown above, and then you can use the model as demonstrated in the basic usage example.
SetFit - Few Shot Classification
SetFit is a method to address the problem of having too few labeled training examples in NLP. The 'nb-sbert-base' can be directly integrated into the SetFit library. Refer to this tutorial for usage details.
Topic Modeling
BERTopic combines sentence transformers with c - TF - IDF to create topic clusters. To use the Norwegian nb - sbert - base, you can use the following code:
topic_model = BERTopic(embedding_model='NbAiLab/nb-sbert-base').fit(docs)
🔧 Technical Details
Evaluation
Property |
Pearson |
Spearman |
Cosine Similarity |
0.8275 |
0.8245 |
Manhattan Distance |
0.8193 |
0.8182 |
Euclidean Distance |
0.8190 |
0.8180 |
Dot Product Similarity |
0.8039 |
0.7951 |
Training
DataLoader:
sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader of length 16471 with parameters:
{'batch_size': 32}
Loss:
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit() - Method:
{
"epochs": 1,
"evaluation_steps": 1647,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 1648,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
📄 License
This model is licensed under the apache-2.0 license.
Citing & Authors
The model was trained by Rolv - Arild Braaten and Per Egil Kummervold. Documentation written by Javier de la Rosa, Rov - Arild Braaten and Per Egil Kummervold.