gbert-large-paraphrase-cosine Open-source German Text Embedding Model - Enhance the Performance of Small-sample Text Classification

Gbert Large Paraphrase Cosine

Developed by deutsche-telekom

A German text embedding model based on the sentence-transformers framework, capable of mapping text to a 1024-dimensional vector space, specifically designed to enhance few-shot text classification performance in German.

Text Embedding

Transformers

GermanOpen Source License:MIT #German sentence similarity #Few-shot learning #1024-dimensional vector space

Downloads 21.03k

Release Time : 1/13/2023

Model Overview

This model is developed based on deepset/gbert-large, using cosine similarity as the metric, suitable for German sentence similarity calculation and few-shot classification tasks.

Model Features

High-quality German embeddings

Trained on a rigorously filtered German back-translation paraphrase dataset to ensure semantic representation quality

Few-shot optimization

Designed specifically for German few-shot learning scenarios, compatible with the SetFit framework

Cosine similarity optimization

Uses MultipleNegativesRankingLoss loss function with cosine similarity as the metric

Model Capabilities

German text embedding

Sentence similarity calculation

Few-shot text classification

Use Cases

Text classification

German short text classification

Performing German short text classification in scenarios with limited labeled data

Outperforms in German few-shot benchmark tests

Semantic search

German document retrieval

Building a German semantic search engine

🚀 German BERT large paraphrase cosine

This is a sentence-transformers model that maps text into a 1024-dimensional dense vector space, designed for German few-shot text classification.

🚀 Quick Start

This is a sentence-transformers model. It maps sentences & paragraphs (text) into a 1024 dimensional dense vector space. The model is intended to be used together with SetFit to improve German few-shot text classification. It has a sibling model called deutsche-telekom/gbert-large-paraphrase-euclidean.

This model is based on deepset/gbert-large. Many thanks to deepset!

✨ Features

Maps sentences and paragraphs into a 1024-dimensional dense vector space.
Intended for use with SetFit to enhance German few-shot text classification.
Has a sibling model for different similarity calculations.

📚 Documentation

Model Information

Property	Details
Model Type	Sentence-transformers model
Base Model	deepset/gbert-large
Loss Function	MultipleNegativesRankingLoss with cosine similarity
Training Data	deutsche-telekom/ger-backtrans-paraphrase (filtered)
Hyperparameters	learning_rate: 8.345726930229726e-06, num_epochs: 7, train_batch_size: 57, num_gpu: 1

Loss Function

We have used MultipleNegativesRankingLoss with cosine similarity as the loss function.

Training Data

The model is trained on a carefully filtered dataset of deutsche-telekom/ger-backtrans-paraphrase. We deleted the following pairs of sentences:

min_char_len less than 15
jaccard_similarity greater than 0.3
de_token_count greater than 30
en_de_token_count greater than 30
cos_sim less than 0.85

Hyperparameters

learning_rate: 8.345726930229726e-06
num_epochs: 7
train_batch_size: 57
num_gpu: 1

🔧 Technical Details

We use the NLU Few-shot Benchmark - English and German dataset to evaluate this model in a German few-shot scenario.

Qualitative results

multilingual sentence embeddings provide the worst results
Electra models also deliver poor results
German BERT base size model (deepset/gbert-base) provides good results
German BERT large size model (deepset/gbert-large) provides very good results
our fine-tuned models (this model and deutsche-telekom/gbert-large-paraphrase-euclidean) provide best results

📄 License

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご