MiniLM-L6-danish-encoder Open-source Model - Lightweight Processing for Danish Text Tasks

Minilm L6 Danish Encoder

Developed by KennethTM

This is a lightweight Danish sentence embedding model, adjusted based on the English MiniLM model, suitable for Danish text processing tasks.

Text Embedding OtherOpen Source License:MIT #Danish sentence vector #Lightweight encoder #Semantic search optimization

Downloads 5,802

Release Time : 1/9/2024

Model Overview

This model can map Danish sentences and paragraphs to a 384-dimensional vector space, supporting tasks such as clustering and semantic search. It is adjusted based on the English MiniLM model, uses a Danish tokenizer, and is trained on machine-translated Danish data.

Model Features

Lightweight design

Only approximately 22 million parameters, with low computational resource requirements

Danish optimization

Specifically adjusted using a Danish tokenizer, suitable for Danish text processing

Long text support

Supports a maximum sequence length of 512 tokens

Transfer learning

Adjusted based on the English MiniLM model rather than trained from scratch

Model Capabilities

Text embedding

Sentence similarity calculation

Semantic search

Text clustering

Use Cases

Information retrieval

Danish semantic search

Build a Danish search engine to achieve search based on semantics rather than keywords

Can understand the query intent and return relevant results

Text analysis

Danish text clustering

Automatically group Danish documents or user comments

Discover similar content or themes

🚀 MiniLM-L6-danish-encoder

This is a lightweight sentence-transformers model for Danish NLP, which maps sentences and paragraphs to a 384 - dimensional dense vector space and can be used for tasks such as clustering or semantic search.

⚠️ Important Note

A new version is available, trained on more data and otherwise identical KennethTM/MiniLM-L6-danish-encoder-v2

Property	Details
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, feature-extraction, sentence-similarity
License	mit
Datasets	squad, eli5, sentence-transformers/embedding-training-data
Language	da
Library Name	sentence-transformers

🚀 Quick Start

This is a lightweight (~22 M parameters) sentence-transformers model for Danish NLP. It maps sentences & paragraphs to a 384 - dimensional dense vector space and can be used for tasks like clustering or semantic search. The maximum sequence length is 512 tokens.

The model was not pre - trained from scratch but adapted from the English version of sentence-transformers/all-MiniLM-L6-v2 with a Danish tokenizer. It was trained on ELI5 and SQUAD data machine - translated from English to Danish.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed, you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]

model = SentenceTransformer('KennethTM/MiniLM-L6-danish-encoder')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling - operation on - top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Kører der cykler på vejen?", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
model = AutoModel.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご