BioLORD-2023-C Open-Source Model - Freely Generate Valuable Representations of Biomedical and Clinical Texts

Biolord 2023 C

Developed by FremyCompany

BioLORD-2023-C is a sentence transformer model trained based on BioLORD, focusing on generating meaningful representations for biomedical and clinical texts.

Text Embedding EnglishOpen Source License:Other #Biomedical Semantic Similarity #Clinical Concept Embedding #Ontology Knowledge Enhancement

Downloads 188.08k

Release Time : 2/12/2024

Model Overview

This model anchors concept representations using definitions and short descriptions extracted from biomedical ontology knowledge graphs, generating semantic concept representations that better align with the ontology hierarchy. Suitable for text similarity tasks involving clinical sentences and biomedical concepts.

Model Features

Semantic Concept Representation

Anchors concept representations using definitions and knowledge graph descriptions to generate semantic representations that better align with the ontology hierarchy.

Multi-stage Training

Adopts a three-stage training strategy, including contrastive learning and self-distillation stages, to optimize model performance.

Biomedical Optimization

Specifically optimized for biomedical and clinical domains, delivering superior performance in processing medical documents such as electronic health records and clinical notes.

Model Capabilities

Sentence similarity calculation

Biomedical text feature extraction

Clinical text embedding generation

Use Cases

Medical Information Processing

Clinical Note Analysis

Analyzes clinical notes in electronic health records to extract key information.

Generates meaningful text representations for subsequent analysis and processing.

Biomedical Concept Matching

Matches biomedical concepts expressed differently, such as 'cat scratch disease' and 'bartonellosis'.

Accurately identifies semantically similar concepts.

🚀 FremyCompany/BioLORD-2023-C

This model is trained with BioLORD, a novel pre - training strategy for generating meaningful representations of clinical sentences and biomedical concepts, achieving state - of - the - art performance in text similarity tasks.

🚀 Quick Start

This model is designed to map sentences and paragraphs to a 768 - dimensional dense vector space, which can be used for tasks such as clustering or semantic search. It has been fine - tuned for the biomedical domain, making it particularly useful for processing medical documents like EHR records or clinical notes.

✨ Features

Innovative Pre - training Strategy: BioLORD uses definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies to ground concept representations, producing more semantic concept representations that align with the hierarchical structure of ontologies.
State - of - the - Art Performance: BioLORD - 2023 sets a new benchmark for text similarity on both clinical sentences (MedSTS) and biomedical concepts (EHR - Rel - B).
Multilingual and Sibling Models: It is accompanied by a series of sibling models, including multilingual and distilled versions, offering more options for different application scenarios.

📦 Installation

If you want to use this model with sentence - transformers, you need to install it first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence - Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-2023-C')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-2023-C')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-2023-C')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Sibling Models

This model is part of the BioLORD - 2023 series, and you may also want to explore the following sibling models:

[BioLORD - 2023 - M](https://huggingface.co/FremyCompany/BioLORD - 2023 - M) (multilingual model; distilled from BioLORD - 2023)
[BioLORD - 2023](https://huggingface.co/FremyCompany/BioLORD - 2023) (best model after model averaging)
[BioLORD - 2023 - S](https://huggingface.co/FremyCompany/BioLORD - 2023 - S) (best hyperparameters; no model averaging)
[BioLORD - 2023 - C](https://huggingface.co/FremyCompany/BioLORD - 2023 - C) (contrastive training only; for NEL tasks; this model)

You can also refer to last year's model and paper:

[BioLORD - 2022](https://huggingface.co/FremyCompany/BioLORD - STAMB2 - v1) (also known as BioLORD - STAMB2 - v1)

Training Strategy

Summary of the 3 phases

![image/png](https://cdn - uploads.huggingface.co/production/uploads/5f04e8865d08220171a0ad3f/my94lNjxATRU_Rg5knUZ8.png)

Contrastive phase: details

![image/png](https://cdn - uploads.huggingface.co/production/uploads/5f04e8865d08220171a0ad3f/_jE2ETcXkLvYLr7TeOdci.png)

Self - distallation phase: details

![image/png](https://cdn - uploads.huggingface.co/production/uploads/5f04e8865d08220171a0ad3f/7xuqi231RB0OzvcxK3bf-.png)

🔧 Technical Details

State - of - the - art methodologies often result in non - semantic representations because biomedical names are not always self - explanatory. BioLORD overcomes this issue by grounding concept representations using definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies. This grounding enables the model to produce more semantic concept representations that match the hierarchical structure of ontologies.

📄 License

My own contributions for this model are covered by the MIT license. However, since the data used to train this model comes from UMLS and SnomedCT, you need to ensure you have proper licensing of UMLS and SnomedCT before using this model. Both UMLS and SnomedCT are free of charge in most countries, but you may need to create an account and report on your usage of the data yearly to maintain a valid license.

📚 Citation

This model is associated with the BioLORD - 2023: Learning Ontological Representations from Definitions paper. When using this model, please cite the original paper as follows:

@article{remy-etal-2023-biolord,
    author = {Remy, Fran√ßois and Demuynck, Kris and Demeester, Thomas},
    title = "{BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights}",
    journal = {Journal of the American Medical Informatics Association},
    pages = {ocae029},
    year = {2024},
    month = {02},
    issn = {1527-974X},
    doi = {10.1093/jamia/ocae029},
    url = {https://doi.org/10.1093/jamia/ocae029},
    eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae029/56772025/ocae029.pdf},
}

⚠️ Important Note

If you are able to, please help me fund my open research. Thank you for your generosity!

Property	Details
Pipeline Tag	sentence - similarity
Tags	sentence - transformers, feature - extraction, sentence - similarity, medical, biology
Language	en
License	other
License Name	ihtsdo - and - nlm - licences
License Link	https://www.nlm.nih.gov/databases/umls.html
Datasets	[FremyCompany/BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset), [FremyCompany/AGCT - Dataset](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご