🚀 FremyCompany/BioLORD-STAMB2-v1
This model is trained using BioLORD, a novel pre - training strategy. It can generate meaningful representations for clinical sentences and biomedical concepts, offering high - quality text similarity performance in relevant fields.
🚀 Quick Start
This model was introduced in 2022, and a new version has been released since then. For most use cases, BioLORD - 2023, our latest generation of BioLORD models, is recommended.
State - of - the - art methods maximize the representation similarity of names referring to the same concept and prevent collapse via contrastive learning. However, due to the non - self - explanatory nature of biomedical names, non - semantic representations may occur. BioLORD addresses this by grounding concept representations with definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies. This results in more semantic concept representations that align better with the hierarchical structure of ontologies. BioLORD sets a new standard for text similarity in both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).
This model is based on sentence - transformers/all - mpnet - base - v2 and is further fine - tuned on the BioLORD - Dataset.
✨ Features
General Purpose
This is a sentence - transformers model. It maps sentences and paragraphs to a 768 - dimensional dense vector space and can be used for tasks like clustering or semantic search. Finetuned for the biomedical domain, it retains the ability to generate embeddings for general - purpose text but is more suitable for processing medical documents such as EHR records or clinical notes. Both sentences and phrases can be embedded in the same latent space.
📦 Installation
Using this model becomes easy when you have sentence - transformers installed:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage (Sentence - Transformers)
from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage (HuggingFace Transformers)
Without sentence - transformers, you can use the model like this: First, pass your input through the transformer model, then apply the right pooling - operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
📚 Documentation
Citation
This model accompanies the BioLORD: Learning Ontological Representations from Definitions paper, accepted in the EMNLP 2022 Findings. When you use this model, please cite the original paper as follows:
@inproceedings{remy-etal-2022-biolord,
title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
author = "Remy, François and
Demuynck, Kris and
Demeester, Thomas",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.104",
pages = "1454--1465",
abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}
You might also want to take a look at our MWE 2023 Paper:
📄 License
My own contributions for this model are covered by the MIT license. However, given the data used to train this model originates from UMLS, you will need to ensure you have proper licensing of UMLS before using this model. UMLS is free of charge in most countries, but you might have to create an account and report on your usage of the data yearly to keep a valid license.
📄 Information Table
Property |
Details |
Model Type |
A sentence - transformers model finetuned for the biomedical domain |
Training Data |
BioLORD - Dataset |
⚠️ Important Note
This model was introduced in 2022. Since then, a new version has been published. For most use cases, you will be better served by BioLORD - 2023, our latest generation of BioLORD models.