đ ModernPubMedBERT
This is a sentence-transformers model trained on the PubMed dataset. It maps sentences and paragraphs to a dense vector space with multiple embedding dimensions (768, 512, 384, 256, 128) using Matryoshka Representation Learning. This provides flexibility to use different embedding sizes according to your application needs while maintaining high performance for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
⨠Features
- Trained on the PubMed dataset, suitable for medical and biomedical applications.
- Uses Matryoshka Representation Learning to support multiple embedding dimensions.
- High performance in various NLP tasks such as semantic textual similarity and semantic search.
đĻ Installation
First, install the Sentence Transformers library:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("lokeshch19/ModernPubMedBERT")
sentences = [
"The patient was diagnosed with type 2 diabetes mellitus",
"The individual shows symptoms of hyperglycemia and insulin resistance",
"Metastatic cancer requires aggressive treatment approaches"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
Loss Function
{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
384,
256,
128
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
Framework Versions
- Python: 3.10.10
- Sentence Transformers: 4.1.0
- Transformers: 4.51.3
- PyTorch: 2.7.0+cu128
- Accelerate: 1.6.0
- Datasets: 3.5.1
- Tokenizers: 0.21.1
đ Documentation
Model Details
Property |
Details |
Model Type |
Sentence Transformer |
Maximum Sequence Length |
8192 tokens |
Output Dimensionality |
768 dimensions |
Similarity Function |
Cosine Similarity |
Language |
en |
License |
apache-2.0 |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
đ License
This project is licensed under the Apache-2.0 license.
đ Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}