gbert-large-paraphrase-euclidean Open-source German Sentence Embedding Model - Free and Efficient Classification with Small Samples

Gbert Large Paraphrase Euclidean

Developed by deutsche-telekom

German sentence embedding model based on sentence-transformers, mapping text to a 1024-dimensional vector space, optimized for few-shot classification

Text Embedding

Transformers

GermanOpen Source License:MIT #German sentence similarity #Euclidean distance optimization #Few-shot learning

Downloads 19.03k

Release Time : 1/13/2023

Model Overview

This model is a German sentence embedding model built on deepset/gbert-large, using Euclidean distance as the similarity metric, specifically designed to enhance German few-shot classification performance when combined with SetFit.

Model Features

Euclidean distance optimization

Trained using BatchHardSoftMarginTripletLoss with Euclidean distance, suitable for specific distance metric requirements

High-quality training data

Based on rigorously filtered German back-translation and paraphrase datasets to ensure training quality

Few-shot optimization

Specifically designed to improve text classification performance in German few-shot scenarios

Siamese model support

Provides a cosine similarity version as a complementary option (deutsche-telekom/gbert-large-paraphrase-cosine)

Model Capabilities

German text embedding

Sentence similarity calculation

Few-shot learning

Text classification support

Use Cases

Text classification

Few-shot classification tasks

German text classification with limited labeled data

Excellent performance on NLU few-shot benchmark tests

Semantic search

German document retrieval

German document search system based on semantic similarity

🚀 German BERT large paraphrase euclidean

This is a sentence-transformers model that maps sentences and paragraphs (text) into a 1024-dimensional dense vector space. It's designed to be used with SetFit to enhance German few-shot text classification. There's also a sibling model named deutsche-telekom/gbert-large-paraphrase-cosine. This model is based on deepset/gbert-large, and we're very grateful to deepset!

🚀 Quick Start

This model is a sentence-transformers model. It maps sentences & paragraphs (text) into a 1024 dimensional dense vector space. The model is intended to be used together with SetFit to improve German few-shot text classification.

✨ Features

Maps sentences and paragraphs into a 1024-dimensional dense vector space.
Intended for use with SetFit to enhance German few-shot text classification.
Has a sibling model deutsche-telekom/gbert-large-paraphrase-cosine.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples for using the model are provided in the original document, so this section is skipped.

📚 Documentation

Training

Loss Function

We have used BatchHardSoftMarginTripletLoss with eucledian distance as the loss function:

    train_loss = losses.BatchHardSoftMarginTripletLoss(
       model=model,
       distance_metric=BatchHardTripletLossDistanceFunction.eucledian_distance,
   )

Training Data

The model is trained on a carefully filtered dataset of deutsche-telekom/ger-backtrans-paraphrase. We deleted the following pairs of sentences:

min_char_len less than 15
jaccard_similarity greater than 0.3
de_token_count greater than 30
en_de_token_count greater than 30
cos_sim less than 0.85

Hyperparameters

Property	Details
learning_rate	5.5512022294147105e-06
num_epochs	7
train_batch_size	68
num_gpu	???

Evaluation Results

We use the NLU Few-shot Benchmark - English and German dataset to evaluate this model in a German few-shot scenario.

Qualitative results

Multilingual sentence embeddings provide the worst results.
Electra models also deliver poor results.
German BERT base size model (deepset/gbert-base) provides good results.
German BERT large size model (deepset/gbert-large) provides very good results.
Our fine-tuned models (this model and deutsche-telekom/gbert-large-paraphrase-cosine) provide best results.

🔧 Technical Details

The model is based on deepset/gbert-large. The training process involves using BatchHardSoftMarginTripletLoss with euclidean distance as the loss function and training on a filtered dataset of deutsche-telekom/ger-backtrans-paraphrase.

📄 License

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご