BioLORD-2023 Open-Source Biomedical Model - Accurately Generate Valuable Representations of Clinical Sentences and Medical Concepts

Biolord 2023

Developed by FremyCompany

BioLORD-2023 is a sentence transformer model specifically designed for the biomedical domain, generating meaningful representations of clinical sentences and biomedical concepts through innovative pre-training strategies.

Text Embedding

Safetensors

EnglishOpen Source License:Other #Medical Semantic Similarity #Biomedical Concept Embedding #Clinical Text Representation

Downloads 4,649

Release Time : 11/27/2023

Model Overview

This model is based on the sentence-transformers/all-mpnet-base-v2 architecture and fine-tuned on biomedical datasets, making it particularly suitable for processing clinical texts and biomedical concepts. It generates semantic representations that better align with ontology hierarchies by incorporating definitions and knowledge graph information.

Model Features

Biomedical Semantic Representation

Generates semantic representations that better align with ontology hierarchies by incorporating definitions and knowledge graph information.

Multi-Stage Training Strategy

Employs a three-stage training strategy of contrastive learning, self-distillation, and model averaging to optimize performance.

Clinical Text Optimization

Specifically optimized for medical documents such as electronic health records and clinical notes.

Model Capabilities

Biomedical Text Embedding

Clinical Sentence Similarity Calculation

Biomedical Concept Matching

Cross-Modal Semantic Search

Use Cases

Clinical Information Retrieval

Clinical Term Matching

Identifies terms that refer to the same clinical concept despite different expressions.

Accurately recognizes semantic similarities between terms like 'cat scratch disease' and 'bartonellosis'.

Biomedical Research

Literature Knowledge Mining

Extracts and associates relevant concepts from biomedical literature.

🚀 FremyCompany/BioLORD-2023

This model, trained with BioLORD, offers meaningful representations for clinical sentences and biomedical concepts, achieving state-of-the-art results in text similarity.

🚀 Quick Start

This model is designed to map sentences and paragraphs to a 768-dimensional dense vector space, suitable for tasks such as clustering or semantic search, especially in the biomedical domain.

✨ Features

Innovative Training Strategy: BioLORD uses a new pre - training strategy, grounding concept representations with definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies. This approach produces more semantic concept representations that align with the hierarchical structure of ontologies.
Domain - Specific Fine - Tuning: Fine - tuned on biomedical datasets, it is well - suited for processing medical documents like EHR records and clinical notes.
Multiple Model Variants: Part of the BioLORD - 2023 series, which includes multilingual, distilled, and contrastive - trained models.

📦 Installation

If you want to use this model with sentence - transformers, you need to install it first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence - Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-2023')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-2023')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-2023')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Sibling models

This model is part of the BioLORD - 2023 series. You might also be interested in the following models:

[BioLORD - 2023 - M](https://huggingface.co/FremyCompany/BioLORD - 2023 - M) (multilingual model; distilled from BioLORD - 2023)
[BioLORD - 2023](https://huggingface.co/FremyCompany/BioLORD - 2023) (best model after model averaging; this model)
[BioLORD - 2023 - S](https://huggingface.co/FremyCompany/BioLORD - 2023 - S) (best hyperparameters; no model averaging)
[BioLORD - 2023 - C](https://huggingface.co/FremyCompany/BioLORD - 2023 - C) (contrastive training only; for NEL tasks)

You can also refer to last year's model and paper:

[BioLORD - 2022](https://huggingface.co/FremyCompany/BioLORD - STAMB2 - v1) (also known as BioLORD - STAMB2 - v1)

Training strategy

Summary of the 3 phases

![image/png](https://cdn - uploads.huggingface.co/production/uploads/5f04e8865d08220171a0ad3f/my94lNjxATRU_Rg5knUZ8.png)

Contrastive phase: details

![image/png](https://cdn - uploads.huggingface.co/production/uploads/5f04e8865d08220171a0ad3f/_jE2ETcXkLvYLr7TeOdci.png)

Self - distallation phase: details

![image/png](https://cdn - uploads.huggingface.co/production/uploads/5f04e8865d08220171a0ad3f/7xuqi231RB0OzvcxK3bf-.png)

🔧 Technical Details

State - of - the - art methodologies often maximize the similarity in representation of names referring to the same concept and prevent collapse through contrastive learning. However, due to the non - self - explanatory nature of biomedical names, they sometimes result in non - semantic representations. BioLORD overcomes this by grounding concept representations using definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies, leading to more semantic concept representations that match the hierarchical structure of ontologies.

📄 License

The author's own contributions for this model are covered by the MIT license. However, since the training data comes from UMLS and SnomedCT, you need to ensure you have proper licensing of UMLS and SnomedCT before using this model. Both UMLS and SnomedCT are free of charge in most countries, but you may need to create an account and report on your usage of the data yearly to maintain a valid license.

📚 Citation

This model accompanies the BioLORD - 2023: Learning Ontological Representations from Definitions paper. When using this model, please cite the original paper as follows:

@article{remy-etal-2023-biolord,
    author = {Remy, François and Demuynck, Kris and Demeester, Thomas},
    title = "{BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights}",
    journal = {Journal of the American Medical Informatics Association},
    pages = {ocae029},
    year = {2024},
    month = {02},
    issn = {1527-974X},
    doi = {10.1093/jamia/ocae029},
    url = {https://doi.org/10.1093/jamia/ocae029},
    eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae029/56772025/ocae029.pdf},
}

🙏 If you are able to, please help me fund my open research. 🙏 Thank you for your generosity! 🤗

Property	Details
Pipeline Tag	sentence - similarity
Tags	sentence - transformers, feature - extraction, sentence - similarity, medical, biology
Language	English
License	other
License Name	ihtsdo - and - nlm - licences
License Link	https://www.nlm.nih.gov/databases/umls.html
Datasets	[FremyCompany/BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset), [FremyCompany/AGCT - Dataset](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご