ModernPubMedBERT Open-Source Sentence Transformer Model - Freely Supports Multi-Dimensional Biomedical Text Processing

Modernpubmedbert

Developed by lokeshch19

A sentence transformer model trained on the PubMed dataset, supporting multiple embedding dimensions, suitable for biomedical text processing.

Text Embedding Open Source License:Apache-2.0 #Biomedical Text Embedding #Multi-dimensional Vector Representation #Medical Semantic Similarity

Downloads 380

Release Time : 4/16/2025

Model Overview

This is a sentence transformer model trained on the PubMed dataset, which maps sentences and paragraphs into dense vector spaces with multiple embedding dimensions through nested representation learning, suitable for tasks such as semantic textual similarity, semantic search, and paraphrase mining.

Model Features

Multiple Embedding Dimensions

Supports various embedding dimensions such as 768, 512, 384, 256, and 128, allowing flexible selection based on application needs.

Long Sequence Support

Supports a maximum sequence length of 8192 tokens, making it suitable for processing long texts.

Biomedical Optimization

Trained on the PubMed dataset, making it particularly suitable for biomedical and clinical text processing.

Model Capabilities

Semantic textual similarity calculation

Semantic search

Paraphrase mining

Text classification

Clustering

Use Cases

Biomedical Literature Processing

Medical Literature Similarity Analysis

Used to calculate semantic similarity between medical literature, helping researchers quickly find relevant documents.

Clinical Diagnosis Assistance

Assists doctors in making diagnostic decisions by analyzing clinical texts.

Text Mining

Medical Text Clustering

Performs clustering analysis on large volumes of medical texts to uncover potential themes or patterns.

🚀 ModernPubMedBERT

This is a sentence-transformers model trained on the PubMed dataset. It maps sentences and paragraphs to a dense vector space with multiple embedding dimensions (768, 512, 384, 256, 128) using Matryoshka Representation Learning. This provides flexibility to use different embedding sizes according to your application needs while maintaining high performance for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

✨ Features

Trained on the PubMed dataset, suitable for medical and biomedical applications.
Uses Matryoshka Representation Learning to support multiple embedding dimensions.
High performance in various NLP tasks such as semantic textual similarity and semantic search.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("lokeshch19/ModernPubMedBERT")
# Run inference
sentences = [
    "The patient was diagnosed with type 2 diabetes mellitus",
    "The individual shows symptoms of hyperglycemia and insulin resistance",
    "Metastatic cancer requires aggressive treatment approaches"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Loss Function

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        384,
        256,
        128
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Framework Versions

Python: 3.10.10
Sentence Transformers: 4.1.0
Transformers: 4.51.3
PyTorch: 2.7.0+cu128
Accelerate: 1.6.0
Datasets: 3.5.1
Tokenizers: 0.21.1

📚 Documentation

Model Details

Property	Details
Model Type	Sentence Transformer
Maximum Sequence Length	8192 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity
Language	en
License	apache-2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご