๐ FremyCompany/BioLORD-2023
This model, trained with BioLORD, offers meaningful representations for clinical sentences and biomedical concepts, achieving state-of-the-art results in text similarity.
๐ Quick Start
This model is designed to map sentences and paragraphs to a 768-dimensional dense vector space, suitable for tasks such as clustering or semantic search, especially in the biomedical domain.
โจ Features
- Innovative Training Strategy: BioLORD uses a new pre - training strategy, grounding concept representations with definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies. This approach produces more semantic concept representations that align with the hierarchical structure of ontologies.
- Domain - Specific Fine - Tuning: Fine - tuned on biomedical datasets, it is well - suited for processing medical documents like EHR records and clinical notes.
- Multiple Model Variants: Part of the BioLORD - 2023 series, which includes multilingual, distilled, and contrastive - trained models.
๐ฆ Installation
If you want to use this model with sentence - transformers
, you need to install it first:
pip install -U sentence-transformers
๐ป Usage Examples
Basic Usage (Sentence - Transformers)
from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
model = SentenceTransformer('FremyCompany/BioLORD-2023')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage (HuggingFace Transformers)
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-2023')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-2023')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
๐ Documentation
Sibling models
This model is part of the BioLORD - 2023 series. You might also be interested in the following models:
- [BioLORD - 2023 - M](https://huggingface.co/FremyCompany/BioLORD - 2023 - M) (multilingual model; distilled from BioLORD - 2023)
- [BioLORD - 2023](https://huggingface.co/FremyCompany/BioLORD - 2023) (best model after model averaging; this model)
- [BioLORD - 2023 - S](https://huggingface.co/FremyCompany/BioLORD - 2023 - S) (best hyperparameters; no model averaging)
- [BioLORD - 2023 - C](https://huggingface.co/FremyCompany/BioLORD - 2023 - C) (contrastive training only; for NEL tasks)
You can also refer to last year's model and paper:
- [BioLORD - 2022](https://huggingface.co/FremyCompany/BioLORD - STAMB2 - v1) (also known as BioLORD - STAMB2 - v1)
Training strategy
Summary of the 3 phases

Contrastive phase: details

Self - distallation phase: details

๐ง Technical Details
State - of - the - art methodologies often maximize the similarity in representation of names referring to the same concept and prevent collapse through contrastive learning. However, due to the non - self - explanatory nature of biomedical names, they sometimes result in non - semantic representations. BioLORD overcomes this by grounding concept representations using definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies, leading to more semantic concept representations that match the hierarchical structure of ontologies.
๐ License
The author's own contributions for this model are covered by the MIT license. However, since the training data comes from UMLS and SnomedCT, you need to ensure you have proper licensing of UMLS and SnomedCT before using this model. Both UMLS and SnomedCT are free of charge in most countries, but you may need to create an account and report on your usage of the data yearly to maintain a valid license.
๐ Citation
This model accompanies the BioLORD - 2023: Learning Ontological Representations from Definitions paper. When using this model, please cite the original paper as follows:
@article{remy-etal-2023-biolord,
author = {Remy, Franรงois and Demuynck, Kris and Demeester, Thomas},
title = "{BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights}",
journal = {Journal of the American Medical Informatics Association},
pages = {ocae029},
year = {2024},
month = {02},
issn = {1527-974X},
doi = {10.1093/jamia/ocae029},
url = {https://doi.org/10.1093/jamia/ocae029},
eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae029/56772025/ocae029.pdf},
}
Property |
Details |
Pipeline Tag |
sentence - similarity |
Tags |
sentence - transformers, feature - extraction, sentence - similarity, medical, biology |
Language |
English |
License |
other |
License Name |
ihtsdo - and - nlm - licences |
License Link |
https://www.nlm.nih.gov/databases/umls.html |
Datasets |
[FremyCompany/BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset), [FremyCompany/AGCT - Dataset](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset) |