🚀 German Semantic V3
This model is designed to create German semantic sentence embeddings. It's the successor of German_Semantic_STS_V2, offering numerous new and exciting features. There are two versions: V3, which is knowledge - heavy, and German_Semantic_V3b, which focuses more on performance.
📋 Metadata
Property |
Details |
Language |
German |
Library Name |
sentence - transformers |
Tags |
sentence - transformers, sentence - similarity, feature - extraction, loss:MatryoshkaLoss |
Base Model |
aari1995/gbert - large - 2 |
Metrics |
spearman_cosine |
Pipeline Tag |
sentence - similarity |
License |
apache - 2.0 |
🎛️ Widget Examples
The model can be tested with the following widget examples:
- Source Sentence: "Bundeskanzler."
- Comparison Sentences: "Angela Merkel.", "Olaf Scholz.", "Tino Chrupalla."
- Source Sentence: "Corona."
- Comparison Sentences: "Virus.", "Krone.", "Bier."
- Source Sentence: "Ein Mann übt Boxen"
- Comparison Sentences: "Ein Affe praktiziert Kampfsportarten.", "Eine Person faltet ein Blatt Papier.", "Eine Frau geht mit ihrem Hund spazieren."
- Source Sentence: "Zwei Frauen laufen."
- Comparison Sentences: "Frauen laufen.", "Die Frau prüft die Augen des Mannes.", "Ein Mann ist auf einem Dach"
- Source Sentence: "Der Mann heißt Joel."
- Comparison Sentences: "Eine Frau namens Jolie", "Ein Mann mit einem englischen Namen.", "Freunde gehen feiern."
✨ Features
- Flexibility: The model is trained with flexible sequence - length and embedding truncation. Although smaller dimensions may cause a minor quality trade - off, flexibility is a core feature.
- Sequence length: It can embed up to 8192 tokens, which is 16 times more than V2 and other models.
- Matryoshka Embeddings: Trained for embedding sizes from 1024 down to 64, allowing for much smaller embeddings with little quality loss.
- German only: This model is German - only, having rich cultural knowledge about Germany and German topics. It can learn more efficiently due to its tokenizer, handle shorter queries better, and be more nuanced in many scenarios.
- Updated knowledge and quality data: Based on gbert - large by deepset, with Stage - 2 pretraining on 1 Billion tokens of German fineweb by occiglot, ensuring up - to - date knowledge.
- Typo and Casing: Trained to be robust against minor typos and casing, which leads to slightly weaker benchmark performance during training but higher robustness of the embeddings.
- Pooling Function: Moved from mean pooling to using the CLS token, which generally learns better after stage - 2 pretraining and allows for more flexibility.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
matryoshka_dim = 1024
model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True, truncate_dim=matryoshka_dim)
sentences = [
'Eine Flagge weht.',
'Die Flagge bewegte sich in der Luft.',
'Zwei Personen beobachten das Wasser.',
]
embeddings = model.encode(sentences, convert_to_tensor=True).half()
similarities = model.similarity(embeddings, embeddings)
📚 Documentation
Frequently Asked Questions
Q: Is this Model better than V2?
A: In terms of flexibility, definitely. Regarding data, yes, as it is more up - to - date. In benchmarks, they differ. V3 is better for longer texts, while V2 works well for shorter texts. Many benchmarks do not cover cultural knowledge comprehensively. If you don't mind the model not knowing about developments after early 2020, it's recommended to use German_Semantic_V3b.
Q: What is the difference between V3 and V3b?
A: V3 performs slightly worse on benchmarks. V3b has a knowledge cutoff by 2020. So, the choice depends on your use - case. If you want peak performance and don't worry about recent developments, choose V3b. If you can sacrifice some benchmark points and want the model to know about events from 2020 on (elections, covid, other cultural events etc.), use this one. Another difference is that V3 has a broader cosine_similarity spectrum (-1 to 1, mostly over - 0.2), while V3b is more aligned with V2 and has a similarity spectrum around 0 to 1. Also, V3 uses cls_pooling while V3b uses mean_pooling.
Q: How does the model perform vs. multilingual models?
A: There are great multilingual models useful for many use - cases. This model stands out with its cultural knowledge about German people and behavior.
Q: What is the trade - off when reducing the embedding size?
A: Generally, when reducing from 1024 to 512 dimensions, there is very little trade - off (1 percent). When going down to 64 dimensions, you may face a decrease of up to 3 percent.
🔧 Technical Details
No specific technical details are provided in the original document.
📄 License
This model is licensed under the Apache 2.0 license.
📈 Evaluation
- Storage comparison:

- Benchmarks: Coming soon.
⏭️ Up Next
German_Semantic_V3_Instruct, which aims to guide embeddings towards self - selected aspects, is planned for 2024.
🙏 Thank You and Credits
- Thanks to jinaAI for their BERT implementation, especially ALiBi.
- Thanks to deepset for the gbert - large model.
- Thanks to occiglot and OSCAR for the data used in pre - training.
- Thanks to Tom for sentence - transformers and feedback, and Björn and Jan from ellamind for consultation.
- Thanks to Meta for XNLI, which is used in variations.
The idea, training, and implementation of this model are by Aaron Chibb.