🚀 FremyCompany/BioLORD-2023-M
This model is based on BioLORD, a novel pre - training strategy, to generate meaningful representations for clinical sentences and biomedical concepts, achieving state - of - the - art results in text similarity tasks.
🚀 Quick Start
This model is designed to map sentences and paragraphs to a 768 - dimensional dense vector space, suitable for tasks like clustering or semantic search. It has been fine - tuned for the biomedical domain, making it particularly useful for processing medical documents such as EHR records or clinical notes.
✨ Features
- Semantic Representation: Overcomes the limitations of traditional methods by grounding concept representations using definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies, producing more semantic concept representations.
- Multilingual Support: Officially supports 7 European languages (English, Spanish, French, German, Dutch, Danish, and Swedish), and many other languages unofficially.
- New State of the Art: Establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (EHR - Rel - B).
📦 Installation
If you want to use this model with sentence - transformers
, you need to install it first:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
Using Sentence - Transformers
from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
model = SentenceTransformer('FremyCompany/BioLORD-2023-M')
embeddings = model.encode(sentences)
print(embeddings)
Using HuggingFace Transformers
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-2023-M')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-2023-M')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
📚 Documentation
Sibling models
This model is part of the BioLORD - 2023 series. You might also be interested in the following models:
You can also refer to last year's model and paper:
Training strategy
Summary of the 3 phases

Contrastive phase: details

Self - distallation phase: details

Citation
This model accompanies the BioLORD - 2023: Learning Ontological Representations from Definitions paper. When using this model, please cite the original paper as follows:
@article{remy-etal-2023-biolord,
author = {Remy, François and Demuynck, Kris and Demeester, Thomas},
title = "{BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights}",
journal = {Journal of the American Medical Informatics Association},
pages = {ocae029},
year = {2024},
month = {02},
issn = {1527-974X},
doi = {10.1093/jamia/ocae029},
url = {https://doi.org/10.1093/jamia/ocae029},
eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae029/56772025/ocae029.pdf},
}
📄 License
My own contributions for this model are covered by the MIT license. However, since the data used to train this model originates from UMLS and SnomedCT, you need to ensure you have proper licensing of UMLS and SnomedCT before using this model. Both UMLS and SnomedCT are free of charge in most countries, but you may have to create an account and report on your usage of the data yearly to keep a valid license.
Property |
Details |
Pipeline Tag |
sentence - similarity |
Tags |
sentence - transformers, feature - extraction, sentence - similarity, medical, biology |
Supported Languages |
English, Spanish, French, German, Dutch, Danish, Swedish (officially); many other languages (unofficially) |
Model Type |
Based on [sentence - transformers/all - mpnet - base - v2](https://huggingface.co/sentence - transformers/all - mpnet - base - v2), finetuned on [BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset) and LLM - generated definitions from [Automatic Glossary of Clinical Terminology (AGCT)](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset) |
Training Data |
[BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset), LLM - generated definitions from [Automatic Glossary of Clinical Terminology (AGCT)](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset) |
License |
MIT for my contributions; need proper UMLS and SnomedCT licensing |
License Name |
ihtsdo - and - nlm - licences |
License Link |
https://www.nlm.nih.gov/databases/umls.html |
⚠️ Important Note
If you are able to, please help me fund my open research. Thank you for your generosity!