robbert-2022-dutch Open-Source Dutch Sentence Transformer - Enabling Semantic Search and Clustering Functions

Robbert 2022 Dutch Sentence Transformers Onnx

Developed by Todai

ONNX version of the Dutch sentence transformer based on the RobBERT model, mapping text to a 768-dimensional vector space, suitable for semantic search and clustering tasks

Text Embedding

Transformers

Other#Dutch sentence embeddings #Semantic similarity calculation #Multi-scenario adaptation

Downloads 30

Release Time : 12/13/2023

Model Overview

This model is the ONNX-converted version of the original robbert-2022-Dutch-Sentence Transformer, specifically designed to convert Dutch sentences and paragraphs into 768-dimensional dense vector representations, supporting tasks such as semantic similarity calculation and text clustering

Model Features

Dutch Language Optimization

Specially optimized and trained for Dutch text, excelling in Dutch semantic understanding tasks

ONNX Format

Converted to ONNX format for easy deployment across different platforms and environments

Semantic Vector Representation

Converts input text into 768-dimensional semantic vectors, capturing deep semantic information

Model Capabilities

Sentence similarity calculation

Semantic search

Text clustering

Feature extraction

Use Cases

Information Retrieval

Duplicate Question Detection

Identifying duplicate questions in forums or Q&A platforms

Effectively recognizes semantically similar but differently phrased questions

Content Management

Document Clustering

Automatically classifying and organizing large volumes of documents

Achieves high-quality document grouping based on semantic similarity

🚀 robbert-2022-dutch-sentence-transformers - Onnx

This Onnx model is a converted version of robbert-2022-dutch-sentence-transformers. It can map sentences & paragraphs to a 768-dimensional dense vector space, useful for tasks like clustering or semantic search.

🚀 Quick Start

Prerequisites

You can use this model easily after installing sentence-transformers:

pip install -U sentence-transformers

Usage Example

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers')
embeddings = model.encode(sentences)
print(embeddings)

✨ Features

Based on KU Leuven's RobBERT model.
Finetuned on the Paraphrase dataset, which has been machine-translated to Dutch.
Can map sentences and paragraphs to a 768-dimensional dense vector space for clustering or semantic search.

📦 Installation

You can install the necessary library using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers')
model = AutoModel.from_pretrained('NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Training Parameters

The model was trained with the following parameters:

DataLoader: MultiDatasetDataLoader.MultiDatasetDataLoader of length 414262 with parameters:

{'batch_size': 1}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 50000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 500,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Model Information

Property	Details
Model Type	Onnx converted from robbert-2022-dutch-sentence-transformers
Training Data	NetherlandsForensicInstitute/AllNLI-translated-nl, NetherlandsForensicInstitute/altlex-translated-nl, NetherlandsForensicInstitute/coco-captions-translated-nl, NetherlandsForensicInstitute/flickr30k-captions-translated-nl, NetherlandsForensicInstitute/msmarco-translated-nl, NetherlandsForensicInstitute/quora-duplicates-translated-nl, NetherlandsForensicInstitute/sentence-compression-translated-nl, NetherlandsForensicInstitute/simplewiki-translated-nl, NetherlandsForensicInstitute/stackexchange-duplicate-questions-translated-nl, NetherlandsForensicInstitute/wiki-atomic-edits-translated-nl

Property

Details

Model Type

Onnx converted from robbert-2022-dutch-sentence-transformers

Training Data

NetherlandsForensicInstitute/AllNLI-translated-nl, NetherlandsForensicInstitute/altlex-translated-nl, NetherlandsForensicInstitute/coco-captions-translated-nl, NetherlandsForensicInstitute/flickr30k-captions-translated-nl, NetherlandsForensicInstitute/msmarco-translated-nl, NetherlandsForensicInstitute/quora-duplicates-translated-nl, NetherlandsForensicInstitute/sentence-compression-translated-nl, NetherlandsForensicInstitute/simplewiki-translated-nl, NetherlandsForensicInstitute/stackexchange-duplicate-questions-translated-nl, NetherlandsForensicInstitute/wiki-atomic-edits-translated-nl

Model Creators

Model creator: Netherlands Forensic Institute
Original model: robbert-2022-dutch-sentence-transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご