nb-sbert-base Open-source Norwegian Model - Freely Perform Sentence and Paragraph Mapping and Similarity Calculation

Nb Sbert Base

Developed by NbAiLab

The NB-SBERT - Basic Version is a Norwegian sentence embedding model based on SentenceTransformers, used to map sentences and paragraphs to a 768 - dimensional vector space and support tasks such as sentence similarity calculation.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Norwegian sentence embedding #Cross-lingual semantic search #Few-shot classification

Downloads 3,675

Release Time : 11/8/2022

Model Overview

This model starts from nb - bert - base and is trained on the machine - translated MNLI dataset version. It is mainly used for tasks such as sentence similarity calculation, semantic search, and clustering.

Model Features

Cross - lingual sentence similarity

The training method of the model enables similar sentences in different languages to be close to each other, supporting cross - lingual sentence similarity calculation.

High - dimensional vector space

Maps sentences and paragraphs to a 768 - dimensional dense vector space, suitable for tasks such as clustering and semantic search.

Easy to integrate

Supports direct use through the sentence - transformers library or HuggingFace Transformers, and provides various usage examples.

Model Capabilities

Sentence embedding generation

Sentence similarity calculation

Semantic search

Text clustering

Keyword extraction

Topic modeling

Use Cases

Information retrieval

Semantic search

Use the embeddings generated by the model for semantic search to find documents or paragraphs semantically similar to the query.

Improve the accuracy and relevance of search results

Text analysis

Keyword extraction

Use the model to extract keywords from documents and identify important words by comparing the similarity between words and documents.

Example extracted keywords such as ('National Library', 0.5242)

Topic modeling

Combine technologies such as BERTopic to perform topic analysis on a collection of documents and discover potential topic structures.

Generate easily interpretable topic clusters

Cross - lingual application

Cross - lingual sentence matching

Identify sentences expressing the same meaning in different languages and support multi - language content alignment.

Ideally, English - Norwegian sentence pairs have high similarity

🚀 NB-SBERT-BASE

NB-SBERT-BASE is a SentenceTransformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, enabling tasks like clustering and semantic search.

🚀 Quick Start

NB-SBERT-BASE is a SentenceTransformers model. It was trained on a machine translated version of the MNLI dataset, starting from nb-bert-base.

This model maps sentences and paragraphs to a 768-dimensional dense vector space. The resulting vectors can be used for tasks such as clustering and semantic search. The easiest way to use the model is to measure the cosine distance between two sentences. Sentences with similar meanings will have a small cosine distance and a similarity close to 1. The model is trained so that similar sentences in different languages are also close to each other. Ideally, an English-Norwegian sentence pair should have high similarity.

✨ Features

Maps sentences & paragraphs to a 768-dimensional dense vector space.
Can be used for tasks like clustering and semantic search.
Enables cross - language sentence similarity measurement.

📦 Installation

Installing the `sentence-transformers` Library

pip install -U sentence-transformers

Installing `keybert` for Keyword Extraction

pip install keybert

Installing `autofaiss` for Similarity Search

pip install autofaiss sentence-transformers

💻 Usage Examples

Basic Usage with `sentence-transformers`

from sentence_transformers import SentenceTransformer, util
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]

model = SentenceTransformer('NbAiLab/nb-sbert-base')
embeddings = model.encode(sentences)
print(embeddings)

# Compute cosine-similarities with sentence transformers
cosine_scores = util.cos_sim(embeddings[0], embeddings[1])
print(cosine_scores)

# Compute cosine-similarities with SciPy
from scipy import spatial
scipy_cosine_scores = 1 - spatial.distance.cosine(embeddings[0], embeddings[1])
print(scipy_cosine_scores)

# Both should give 0.8250 in the example above.

Keyword Extraction with `KeyBERT`

from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("NbAiLab/nb-sbert-base")
kw_model = KeyBERT(model=sentence_model)

doc = """
De første nasjonale bibliotek har sin opprinnelse i kongelige samlinger eller en annen framstående myndighet eller statsoverhode. 
Et av de første planene for et nasjonalbibliotek i England ble fremmet av den walisiske matematikeren og mystikeren John Dee som 
i 1556 presenterte en visjonær plan om et nasjonalt bibliotek for gamle bøker, manuskripter og opptegnelser for dronning Maria I 
av England. Hans forslag ble ikke tatt til følge.
"""
kw_model.extract_keywords(doc, stop_words=None)

# [('nasjonalbibliotek', 0.5242), ('bibliotek', 0.4342), ('samlinger', 0.3334), ('statsoverhode', 0.33), ('manuskripter', 0.3061)]

Similarity Search with `autofaiss`

from autofaiss import build_index
import numpy as np

from sentence_transformers import SentenceTransformer, util
sentences = ["This is a Norwegian boy", "Dette er en norsk gutt", "A red house"]

model = SentenceTransformer('NbAiLab/nb-sbert-base')
embeddings = model.encode(sentences)
index, index_infos = build_index(embeddings, save_on_disk=False)

# Search for the closest matches
query = model.encode(["A young boy"])
_, index_matches = index.search(query, 1)
print(index_matches)

📚 Documentation

Embeddings and Sentence Similarity (Sentence - Transformers)

Using the sentence-transformers library makes it convenient to use these models. First, install the library as shown above, and then you can use the model as demonstrated in the basic usage example.

SetFit - Few Shot Classification

SetFit is a method to address the problem of having too few labeled training examples in NLP. The 'nb-sbert-base' can be directly integrated into the SetFit library. Refer to this tutorial for usage details.

Topic Modeling

BERTopic combines sentence transformers with c - TF - IDF to create topic clusters. To use the Norwegian nb - sbert - base, you can use the following code:

topic_model = BERTopic(embedding_model='NbAiLab/nb-sbert-base').fit(docs)

🔧 Technical Details

Evaluation

Property	Pearson	Spearman
Cosine Similarity	0.8275	0.8245
Manhattan Distance	0.8193	0.8182
Euclidean Distance	0.8190	0.8180
Dot Product Similarity	0.8039	0.7951

Training

DataLoader:

sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader of length 16471 with parameters:
{'batch_size': 32}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit() - Method:

{
    "epochs": 1,
    "evaluation_steps": 1647,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 1648,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

This model is licensed under the apache-2.0 license.

Citing & Authors

The model was trained by Rolv - Arild Braaten and Per Egil Kummervold. Documentation written by Javier de la Rosa, Rov - Arild Braaten and Per Egil Kummervold.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご