The open-source model cross-en-de-roberta-sentence-transformer - supports English-German cross-language and is used for tasks such as semantic search.

Cross En De Roberta Sentence Transformer

Developed by T-Systems-onsite

A cross-lingual sentence embedding model supporting English and German, applicable for tasks such as semantic textual similarity, semantic search, and paraphrase mining.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Cross-lingual Semantic Search #German-English Sentence Embeddings #High-precision Similarity Calculation

Downloads 7,305

Release Time : 3/2/2022

Model Overview

This model is based on the xlm-roberta-base architecture, fine-tuned for multilingual capabilities, enabling the conversion of English and German sentences into semantically similar vector representations, supporting cross-lingual semantic search and similarity comparison.

Model Features

Cross-lingual Capability

Supports cross-lingual semantic search and similarity comparison between English and German.

High Performance

Excels in English and German STSbenchmark tests, even surpassing dedicated large English models.

Multilingual Fine-tuning

Enhanced cross-lingual performance through multilingual fine-tuning and cross-language training.

Model Capabilities

Compute sentence embeddings

Semantic textual similarity comparison

Semantic search

Paraphrase mining

Use Cases

Information Retrieval

Cross-lingual Semantic Search

Use German search queries to find semantically relevant results in both English and German.

High accuracy in search results, supporting cross-lingual matching.

Text Analysis

Semantic Similarity Analysis

Compare semantic similarity between different sentences for text clustering or classification.

Outstanding performance in STSbenchmark tests.

🚀 Cross English & German RoBERTa for Sentence Embeddings

This model computes sentence (text) embeddings for English and German text, facilitating semantic comparison and cross - lingual search.

🚀 Quick Start

This model is designed to compute sentence (text) embeddings for both English and German text. These embeddings can be compared using cosine - similarity to identify sentences with similar semantic meanings. It's useful for semantic textual similarity, semantic search, or paraphrase mining. To use it, you need the Sentence Transformers Python framework.

The model's uniqueness lies in its cross - lingual capabilities. Regardless of the input language, sentences are transformed into semantically similar vectors. For example, you can conduct a German search and find relevant results in both German and English. By using an xlm model and multilingual finetuning with language - crossing, it outperforms the best current dedicated English large model (see the Evaluation section below).

Sentence - BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine - similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

Source: Sentence - BERT: Sentence Embeddings using Siamese BERT - Networks

This model is fine - tuned by Philip May and open - sourced by [T - Systems - onsite](https://www.t - systems - onsite.de/). Special thanks to [Nils Reimers](https://www.nils - reimers.de/) for the Sentence Transformers, models, and help on GitHub.

✨ Features

Cross - lingual Capability: Works effectively across English and German languages, enabling cross - language semantic search.
High Performance: Outperforms some current dedicated English large models through multilingual finetuning with language - crossing.
Semantic Embeddings: Computes semantically meaningful sentence embeddings that can be compared using cosine - similarity.

📦 Installation

To use this model, you need to install the sentence - transformers package. You can find more details here: <https://github.com/UKPLab/sentence - transformers>

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('T - Systems - onsite/cross - en - de - roberta - sentence - transformer')

For more detailed usage and examples, refer to the following links:

Computing Sentence Embeddings
Semantic Textual Similarity
Paraphrase Mining
Semantic Search
[Cross - Encoders](https://www.sbert.net/docs/usage/cross - encoder.html)
[Examples on GitHub](https://github.com/UKPLab/sentence - transformers/tree/master/examples)

📚 Documentation

Training

The base model is [xlm - roberta - base](https://huggingface.co/xlm - roberta - base). It was further trained by [Nils Reimers](https://www.nils - reimers.de/) on a large - scale paraphrase dataset for 50+ languages. [Nils Reimers](https://www.nils - reimers.de/) provided the following details [on GitHub](https://github.com/UKPLab/sentence - transformers/issues/509#issuecomment - 712243280):

A paper is upcoming for the paraphrase models.

These models were trained on various datasets with Millions of examples for paraphrases, mainly derived from Wikipedia edit logs, paraphrases mined from Wikipedia and SimpleWiki, paraphrases from news reports, AllNLI - entailment pairs with in - batch - negative loss etc.

In internal tests, they perform much better than the NLI+STSb models as they have see more and broader type of training data. NLI+STSb has the issue that they are rather narrow in their domain and do not contain any domain specific words / sentences (like from chemistry, computer science, math etc.). The paraphrase models has seen plenty of sentences from various domains.

More details with the setup, all the datasets, and a wider evaluation will follow soon.

The resulting model xlm - r - distilroberta - base - paraphrase - v1 was released here: <https://github.com/UKPLab/sentence - transformers/releases/tag/v0.3.8>

Based on this cross - language model, we fine - tuned it for English and German on the STSbenchmark dataset. For German, we used the dataset from our [German STSbenchmark dataset](https://github.com/t - systems - on - site - services - gmbh/german - STSbenchmark), translated with deepl.com. We also generated English - German crossed samples, which we call multilingual finetuning with language - crossing. This approach doubled the training data size and improved performance.

We conducted an automatic hyperparameter search for 33 trials using Optuna. Through 10 - fold cross - validation on the deepl.com test and dev dataset, we found the best hyperparameters:

batch_size = 8
num_epochs = 2
lr = 1.026343323298136e - 05
eps = 4.462251033010287e - 06
weight_decay = 0.04794438776350409
warmup_steps_proportion = 0.1609010732760181

The final model was trained with these hyperparameters on the combined train and dev datasets of English, German, and their crossed samples. The test set was reserved for testing.

Evaluation

The evaluation was performed on English, German, and cross - language data using the STSbenchmark test data. The evaluation code is available on Colab. We used Spearman’s rank correlation between the cosine - similarity of sentence embeddings and STSbenchmark labels as the evaluation metric.

Model Name	Spearman German	Spearman English	Spearman EN - DE & DE - EN (cross)
xlm - r - distilroberta - base - paraphrase - v1	0.8079	0.8350	0.7983
[xlm - r - 100langs - bert - base - nli - stsb - mean - tokens](https://huggingface.co/sentence - transformers/xlm - r - 100langs - bert - base - nli - stsb - mean - tokens)	0.7877	0.8465	0.7908
xlm - r - bert - base - nli - stsb - mean - tokens	0.7877	0.8465	0.7908
[roberta - large - nli - stsb - mean - tokens](https://huggingface.co/sentence - transformers/roberta - large - nli - stsb - mean - tokens)	0.6371	0.8639	0.4109
[T - Systems - onsite/ german - roberta - sentence - transformer - v2](https://huggingface.co/T - Systems - onsite/german - roberta - sentence - transformer - v2)	0.8529	0.8634	0.8415
[paraphrase - multilingual - mpnet - base - v2](https://huggingface.co/sentence - transformers/paraphrase - multilingual - mpnet - base - v2)	0.8355	0.8682	0.8309
T - Systems - onsite/ cross - en - de - roberta - sentence - transformer	0.8550	0.8660	0.8525

📄 License

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You can obtain a copy of the License by reviewing the file [LICENSE](https://huggingface.co/T - Systems - onsite/cross - en - de - roberta - sentence - transformer/blob/main/LICENSE) in the repository.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご