xlm-roberta-de Open-Source German Sentence Embedding Model - Free Deployment to Boost Semantic Search and Clustering

Xlm Roberta De

Developed by airnicco8

German sentence embedding model based on XLM-RoBERTa architecture, mapping text to 768-dimensional vector space, suitable for semantic search and clustering tasks

Text Embedding

Transformers

Other#German semantic vectors #Cross-lingual transfer learning #TED text optimization

Downloads 22

Release Time : 10/18/2022

Model Overview

This model is a German text embedding model based on the sentence-transformers framework, specifically optimized for German content, and can be used for sentence similarity calculation, natural language inference, and text classification tasks

Model Features

German-specific optimization

Trained specifically on German TED talk transcripts, providing better semantic understanding of German content

768-dimensional dense vectors

Maps sentences and paragraphs to a 768-dimensional dense vector space, preserving rich semantic information

Multi-task support

Supports sentence similarity calculation and can be fine-tuned for natural language inference and text classification tasks

Model Capabilities

Sentence embedding

Semantic similarity calculation

Text feature extraction

German text processing

Clustering analysis

Use Cases

Information retrieval

Semantic search system

Building an efficient semantic search engine for German content

Provides more relevant results compared to keyword search

Content analysis

Text clustering

Topic clustering analysis for German documents

Automatically discovers topic distributions in document collections

Intelligent applications

Question answering system

Serves as the semantic understanding component for German QA systems

Improves matching accuracy between questions and answers

🚀 airnicco8/xlm-roberta-de

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. The model is trained on Ted talks transcripts filtered by the German language. The training setting is detailed here. It can be directly used for sentence similarity and can also be fine-tuned for NLI and text classification. Examples will be provided soon.

✨ Features

Maps sentences and paragraphs to a 768-dimensional dense vector space.
Can be used for clustering and semantic search.
Applicable for sentence similarity tasks and can be fine - tuned for NLI and text classification.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed:

from sentence_transformers import SentenceTransformer
sentences = ["das ist eine glückliche Frau", "das ist ein glücklicher Mann", "das ist ein glücklicher Hund"]

model = SentenceTransformer('airnicco8/xlm-roberta-de')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["das ist eine glückliche Frau", "das ist ein glücklicher Mann", "das ist ein glücklicher Hund"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('airnicco8/xlm-roberta-de')
model = AutoModel.from_pretrained('airnicco8/xlm-roberta-de')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 3071 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MSELoss.MSELoss

Parameters of the fit()-Method:

{
    "epochs": 10,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "eps": 1e-06,
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 1000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

Property	Details
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers, german, nli, text-classification

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご