German_Semantic_V3 Open-source Model - Free Deployment, Accurately Achieve German Semantic Understanding and Sentence Embedding

German Semantic V3

Developed by aari1995

A sentence embedding model focused on German semantic understanding, supporting variable sequence lengths and nested embeddings, with knowledge updated post-2020

Text Embedding

Safetensors

GermanOpen Source License:Apache-2.0 #German Semantic Embedding #Variable Dimension Truncation #German Cultural Understanding

Downloads 1,646

Release Time : 6/23/2024

Model Overview

A model for generating German semantic sentence embeddings, supporting sentence similarity calculation and feature extraction

Model Features

Flexibility

Supports variable sequence lengths and embedding truncation training, with a maximum of 8192 tokens

Nested Embeddings

Supports embedding dimensions from 1024 to 64, significantly reducing storage space with minimal quality loss

Pure German Model

Focused on German scenarios, rich in German cultural knowledge, with a dedicated tokenizer for more efficient short query processing

Updated Knowledge

Based on the gbert-large model, with second-stage pre-training using 1 billion German fineweb tokens

Robustness

Enhanced tolerance for spelling errors and case sensitivity, with higher embedding stability

Model Capabilities

German Semantic Understanding

Sentence Similarity Calculation

Feature Extraction

Long Text Processing

Use Cases

Semantic Search

Political Figure Search

Identify descriptions related to political figures

Can correctly associate 'Federal Chancellor' with 'Angela Merkel' and 'Olaf Scholz'

Content Understanding

Virus-related Terms

Distinguish 'COVID-19' from similar terms

Can correctly differentiate 'COVID-19' from 'virus', 'crown', and 'beer'

Behavior Recognition

Human Activity Recognition

Understand sentences describing human activities

Can distinguish 'a man practicing boxing' from 'a monkey practicing martial arts' and similar descriptions

🚀 German Semantic V3

This model is designed to create German semantic sentence embeddings. It's the successor of German_Semantic_STS_V2, offering numerous new and exciting features. There are two versions: V3, which is knowledge - heavy, and German_Semantic_V3b, which focuses more on performance.

📋 Metadata

Property	Details
Language	German
Library Name	sentence - transformers
Tags	sentence - transformers, sentence - similarity, feature - extraction, loss:MatryoshkaLoss
Base Model	aari1995/gbert - large - 2
Metrics	spearman_cosine
Pipeline Tag	sentence - similarity
License	apache - 2.0

🎛️ Widget Examples

The model can be tested with the following widget examples:

Source Sentence: "Bundeskanzler."
- Comparison Sentences: "Angela Merkel.", "Olaf Scholz.", "Tino Chrupalla."
Source Sentence: "Corona."
- Comparison Sentences: "Virus.", "Krone.", "Bier."
Source Sentence: "Ein Mann übt Boxen"
- Comparison Sentences: "Ein Affe praktiziert Kampfsportarten.", "Eine Person faltet ein Blatt Papier.", "Eine Frau geht mit ihrem Hund spazieren."
Source Sentence: "Zwei Frauen laufen."
- Comparison Sentences: "Frauen laufen.", "Die Frau prüft die Augen des Mannes.", "Ein Mann ist auf einem Dach"
Source Sentence: "Der Mann heißt Joel."
- Comparison Sentences: "Eine Frau namens Jolie", "Ein Mann mit einem englischen Namen.", "Freunde gehen feiern."

✨ Features

Flexibility: The model is trained with flexible sequence - length and embedding truncation. Although smaller dimensions may cause a minor quality trade - off, flexibility is a core feature.
Sequence length: It can embed up to 8192 tokens, which is 16 times more than V2 and other models.
Matryoshka Embeddings: Trained for embedding sizes from 1024 down to 64, allowing for much smaller embeddings with little quality loss.
German only: This model is German - only, having rich cultural knowledge about Germany and German topics. It can learn more efficiently due to its tokenizer, handle shorter queries better, and be more nuanced in many scenarios.
Updated knowledge and quality data: Based on gbert - large by deepset, with Stage - 2 pretraining on 1 Billion tokens of German fineweb by occiglot, ensuring up - to - date knowledge.
Typo and Casing: Trained to be robust against minor typos and casing, which leads to slightly weaker benchmark performance during training but higher robustness of the embeddings.
Pooling Function: Moved from mean pooling to using the CLS token, which generally learns better after stage - 2 pretraining and allows for more flexibility.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

matryoshka_dim = 1024 # How big your embeddings should be, choose from: 64, 128, 256, 512, 768, 1024
model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True, truncate_dim=matryoshka_dim)

# model.truncate_dim = 64 # truncation dimensions can also be changed after loading
# model.max_seq_length = 512 #optionally, set your maximum sequence length lower if your hardware is limited 

# Run inference
sentences = [
    'Eine Flagge weht.',
    'Die Flagge bewegte sich in der Luft.',
    'Zwei Personen beobachten das Wasser.',
]

# For FP16 embeddings (half space, no quality loss)
embeddings = model.encode(sentences, convert_to_tensor=True).half()

# For FP32 embeddings (takes more space)
# embeddings = model.encode(sentences)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)

📚 Documentation

Frequently Asked Questions

Q: Is this Model better than V2? A: In terms of flexibility, definitely. Regarding data, yes, as it is more up - to - date. In benchmarks, they differ. V3 is better for longer texts, while V2 works well for shorter texts. Many benchmarks do not cover cultural knowledge comprehensively. If you don't mind the model not knowing about developments after early 2020, it's recommended to use German_Semantic_V3b.

Q: What is the difference between V3 and V3b? A: V3 performs slightly worse on benchmarks. V3b has a knowledge cutoff by 2020. So, the choice depends on your use - case. If you want peak performance and don't worry about recent developments, choose V3b. If you can sacrifice some benchmark points and want the model to know about events from 2020 on (elections, covid, other cultural events etc.), use this one. Another difference is that V3 has a broader cosine_similarity spectrum (-1 to 1, mostly over - 0.2), while V3b is more aligned with V2 and has a similarity spectrum around 0 to 1. Also, V3 uses cls_pooling while V3b uses mean_pooling.

Q: How does the model perform vs. multilingual models? A: There are great multilingual models useful for many use - cases. This model stands out with its cultural knowledge about German people and behavior.

Q: What is the trade - off when reducing the embedding size? A: Generally, when reducing from 1024 to 512 dimensions, there is very little trade - off (1 percent). When going down to 64 dimensions, you may face a decrease of up to 3 percent.

🔧 Technical Details

No specific technical details are provided in the original document.

📄 License

This model is licensed under the Apache 2.0 license.

📈 Evaluation

Storage comparison:
Benchmarks: Coming soon.

⏭️ Up Next

German_Semantic_V3_Instruct, which aims to guide embeddings towards self - selected aspects, is planned for 2024.

🙏 Thank You and Credits

Thanks to jinaAI for their BERT implementation, especially ALiBi.
Thanks to deepset for the gbert - large model.
Thanks to occiglot and OSCAR for the data used in pre - training.
Thanks to Tom for sentence - transformers and feedback, and Björn and Jan from ellamind for consultation.
Thanks to Meta for XNLI, which is used in variations.

The idea, training, and implementation of this model are by Aaron Chibb.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご