🚀 Cross English & German RoBERTa for Sentence Embeddings
This model computes sentence (text) embeddings for English and German text, facilitating semantic comparison and cross - lingual search.
🚀 Quick Start
This model is designed to compute sentence (text) embeddings for both English and German text. These embeddings can be compared using cosine - similarity to identify sentences with similar semantic meanings. It's useful for semantic textual similarity, semantic search, or paraphrase mining. To use it, you need the Sentence Transformers Python framework.
The model's uniqueness lies in its cross - lingual capabilities. Regardless of the input language, sentences are transformed into semantically similar vectors. For example, you can conduct a German search and find relevant results in both German and English. By using an xlm model and multilingual finetuning with language - crossing, it outperforms the best current dedicated English large model (see the Evaluation section below).
Sentence - BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine - similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
Source: Sentence - BERT: Sentence Embeddings using Siamese BERT - Networks
This model is fine - tuned by Philip May and open - sourced by [T - Systems - onsite](https://www.t - systems - onsite.de/). Special thanks to [Nils Reimers](https://www.nils - reimers.de/) for the Sentence Transformers, models, and help on GitHub.
✨ Features
- Cross - lingual Capability: Works effectively across English and German languages, enabling cross - language semantic search.
- High Performance: Outperforms some current dedicated English large models through multilingual finetuning with language - crossing.
- Semantic Embeddings: Computes semantically meaningful sentence embeddings that can be compared using cosine - similarity.
📦 Installation
To use this model, you need to install the sentence - transformers
package. You can find more details here: <https://github.com/UKPLab/sentence - transformers>
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T - Systems - onsite/cross - en - de - roberta - sentence - transformer')
For more detailed usage and examples, refer to the following links:
📚 Documentation
Training
The base model is [xlm - roberta - base](https://huggingface.co/xlm - roberta - base). It was further trained by [Nils Reimers](https://www.nils - reimers.de/) on a large - scale paraphrase dataset for 50+ languages. [Nils Reimers](https://www.nils - reimers.de/) provided the following details [on GitHub](https://github.com/UKPLab/sentence - transformers/issues/509#issuecomment - 712243280):
A paper is upcoming for the paraphrase models.
These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI - entailment pairs with in - batch - negative loss etc.
In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.
More details with the setup, all the datasets, and a wider evaluation will follow soon.
The resulting model xlm - r - distilroberta - base - paraphrase - v1
was released here: <https://github.com/UKPLab/sentence - transformers/releases/tag/v0.3.8>
Based on this cross - language model, we fine - tuned it for English and German on the STSbenchmark dataset. For German, we used the dataset from our [German STSbenchmark dataset](https://github.com/t - systems - on - site - services - gmbh/german - STSbenchmark), translated with deepl.com. We also generated English - German crossed samples, which we call multilingual finetuning with language - crossing. This approach doubled the training data size and improved performance.
We conducted an automatic hyperparameter search for 33 trials using Optuna. Through 10 - fold cross - validation on the deepl.com test and dev dataset, we found the best hyperparameters:
- batch_size = 8
- num_epochs = 2
- lr = 1.026343323298136e - 05
- eps = 4.462251033010287e - 06
- weight_decay = 0.04794438776350409
- warmup_steps_proportion = 0.1609010732760181
The final model was trained with these hyperparameters on the combined train and dev datasets of English, German, and their crossed samples. The test set was reserved for testing.
Evaluation
The evaluation was performed on English, German, and cross - language data using the STSbenchmark test data. The evaluation code is available on Colab. We used Spearman’s rank correlation between the cosine - similarity of sentence embeddings and STSbenchmark labels as the evaluation metric.
Model Name |
Spearman German |
Spearman English |
Spearman EN - DE & DE - EN (cross) |
xlm - r - distilroberta - base - paraphrase - v1 |
0.8079 |
0.8350 |
0.7983 |
[xlm - r - 100langs - bert - base - nli - stsb - mean - tokens](https://huggingface.co/sentence - transformers/xlm - r - 100langs - bert - base - nli - stsb - mean - tokens) |
0.7877 |
0.8465 |
0.7908 |
xlm - r - bert - base - nli - stsb - mean - tokens |
0.7877 |
0.8465 |
0.7908 |
[roberta - large - nli - stsb - mean - tokens](https://huggingface.co/sentence - transformers/roberta - large - nli - stsb - mean - tokens) |
0.6371 |
0.8639 |
0.4109 |
[T - Systems - onsite/ german - roberta - sentence - transformer - v2](https://huggingface.co/T - Systems - onsite/german - roberta - sentence - transformer - v2) |
0.8529 |
0.8634 |
0.8415 |
[paraphrase - multilingual - mpnet - base - v2](https://huggingface.co/sentence - transformers/paraphrase - multilingual - mpnet - base - v2) |
0.8355 |
0.8682 |
0.8309 |
T - Systems - onsite/ cross - en - de - roberta - sentence - transformer |
0.8550 |
0.8660 |
0.8525 |
📄 License
Copyright (c) 2020 Philip May, T - Systems on site services GmbH
Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You can obtain a copy of the License by reviewing the file [LICENSE](https://huggingface.co/T - Systems - onsite/cross - en - de - roberta - sentence - transformer/blob/main/LICENSE) in the repository.