๐ FremyCompany/BioLORD-2023-C
This model is trained with BioLORD, a novel pre - training strategy for generating meaningful representations of clinical sentences and biomedical concepts, achieving state - of - the - art performance in text similarity tasks.
๐ Quick Start
This model is designed to map sentences and paragraphs to a 768 - dimensional dense vector space, which can be used for tasks such as clustering or semantic search. It has been fine - tuned for the biomedical domain, making it particularly useful for processing medical documents like EHR records or clinical notes.
โจ Features
- Innovative Pre - training Strategy: BioLORD uses definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies to ground concept representations, producing more semantic concept representations that align with the hierarchical structure of ontologies.
- State - of - the - Art Performance: BioLORD - 2023 sets a new benchmark for text similarity on both clinical sentences (MedSTS) and biomedical concepts (EHR - Rel - B).
- Multilingual and Sibling Models: It is accompanied by a series of sibling models, including multilingual and distilled versions, offering more options for different application scenarios.
๐ฆ Installation
If you want to use this model with sentence - transformers
, you need to install it first:
pip install -U sentence-transformers
๐ป Usage Examples
Basic Usage (Sentence - Transformers)
from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
model = SentenceTransformer('FremyCompany/BioLORD-2023-C')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage (HuggingFace Transformers)
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-2023-C')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-2023-C')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
๐ Documentation
Sibling Models
This model is part of the BioLORD - 2023 series, and you may also want to explore the following sibling models:
- [BioLORD - 2023 - M](https://huggingface.co/FremyCompany/BioLORD - 2023 - M) (multilingual model; distilled from BioLORD - 2023)
- [BioLORD - 2023](https://huggingface.co/FremyCompany/BioLORD - 2023) (best model after model averaging)
- [BioLORD - 2023 - S](https://huggingface.co/FremyCompany/BioLORD - 2023 - S) (best hyperparameters; no model averaging)
- [BioLORD - 2023 - C](https://huggingface.co/FremyCompany/BioLORD - 2023 - C) (contrastive training only; for NEL tasks; this model)
You can also refer to last year's model and paper:
- [BioLORD - 2022](https://huggingface.co/FremyCompany/BioLORD - STAMB2 - v1) (also known as BioLORD - STAMB2 - v1)
Training Strategy
Summary of the 3 phases

Contrastive phase: details

Self - distallation phase: details

๐ง Technical Details
State - of - the - art methodologies often result in non - semantic representations because biomedical names are not always self - explanatory. BioLORD overcomes this issue by grounding concept representations using definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies. This grounding enables the model to produce more semantic concept representations that match the hierarchical structure of ontologies.
๐ License
My own contributions for this model are covered by the MIT license. However, since the data used to train this model comes from UMLS and SnomedCT, you need to ensure you have proper licensing of UMLS and SnomedCT before using this model. Both UMLS and SnomedCT are free of charge in most countries, but you may need to create an account and report on your usage of the data yearly to maintain a valid license.
๐ Citation
This model is associated with the BioLORD - 2023: Learning Ontological Representations from Definitions paper. When using this model, please cite the original paper as follows:
@article{remy-etal-2023-biolord,
author = {Remy, Franโรois and Demuynck, Kris and Demeester, Thomas},
title = "{BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights}",
journal = {Journal of the American Medical Informatics Association},
pages = {ocae029},
year = {2024},
month = {02},
issn = {1527-974X},
doi = {10.1093/jamia/ocae029},
url = {https://doi.org/10.1093/jamia/ocae029},
eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae029/56772025/ocae029.pdf},
}
โ ๏ธ Important Note
If you are able to, please help me fund my open research. Thank you for your generosity!
Property |
Details |
Pipeline Tag |
sentence - similarity |
Tags |
sentence - transformers, feature - extraction, sentence - similarity, medical, biology |
Language |
en |
License |
other |
License Name |
ihtsdo - and - nlm - licences |
License Link |
https://www.nlm.nih.gov/databases/umls.html |
Datasets |
[FremyCompany/BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset), [FremyCompany/AGCT - Dataset](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset) |