Medical_embedded_v4 Open-source Multilingual Sentence Embedding Model - Free for Clustering and Semantic Search Tasks

Medical Embedded V4

Developed by shtilev

This is a multilingual sentence embedding model that can map sentences and paragraphs to a 768-dimensional vector space, suitable for tasks such as clustering and semantic search.

Text Embedding Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual sentence embedding #Semantic similarity calculation #768-dimensional vector space

Downloads 202

Release Time : 6/21/2025

Model Overview

This model is based on the XLM-RoBERTa architecture and can generate high-quality sentence embeddings. It supports multiple languages and is suitable for scenarios such as semantic similarity calculation and information retrieval.

Model Features

Multilingual support

Supports sentence embeddings for multiple languages including Arabic, Bulgarian, and Catalan

High-quality embeddings

Based on the XLM-RoBERTa architecture, generates high-quality 768-dimensional sentence embedding vectors

Semantic understanding

Can effectively capture the semantic information of sentences, suitable for semantic similarity calculation

Model Capabilities

Sentence embedding

Semantic similarity calculation

Multilingual text processing

Information retrieval

Text clustering

Use Cases

Information retrieval

Semantic search

Use sentence embeddings to implement document search based on semantics rather than keywords

Improve the relevance of search results

Text analysis

Text clustering

Automatically group documents with similar semantics

Implement unsupervised document classification

🚀 sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This project is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

Prerequisites

Supported languages include multilingual, ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi. The language in BCP47 format includes fr-ca, pt-br, zh-cn, zh-tw. The pipeline tag is sentence-similarity, and the license is apache - 2.0.

Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Usage

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling - operation on - top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, average pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

This project is licensed under the apache - 2.0 license.

📚 Documentation

Citing & Authors

This model was trained by sentence-transformers.

If you find this model helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}

Information Table

Property	Details
Library Name	sentence-transformers
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers
Model Type	A model that maps sentences & paragraphs to a 768 dimensional dense vector space
Training Data	Not provided
License	apache - 2.0
Supported Languages	multilingual, ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi
BCP47 Languages	fr-ca, pt-br, zh-cn, zh-tw
Pipeline Tag	sentence-similarity

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご