BioLORD-STAMB2-v1 Open-source Model - Free Deployment to Achieve Semantic Representation of Clinical Sentences and Biomedical Concepts

Biolord STAMB2 V1

Developed by FremyCompany

BioLORD is a novel pre-training strategy model designed to generate semantic representations for clinical statements and biomedical concepts

Text Embedding

PyTorch

EnglishOpen Source License:Other #Biomedical Semantic Embedding #Clinical Term Similarity #Ontology Representation Learning

Downloads 49

Release Time : 10/20/2022

Model Overview

This model generates semantic representations that better align with ontological hierarchies by anchoring concept representations to definitions and short descriptions derived from biomedical ontologies. It is suitable for processing medical documents such as Electronic Health Records (EHR) or clinical notes.

Model Features

Semantic Representation Generation

Generates semantic representations that conform to biomedical ontological hierarchies by anchoring concept definitions and ontological descriptions

Biomedical Domain Optimization

Fine-tuned specifically for the biomedical domain, capable of efficiently processing clinical documents and medical terminology

Multi-task Support

Simultaneously supports similarity calculations for both clinical statements and biomedical concepts

Model Capabilities

Sentence similarity calculation

Biomedical concept representation generation

Clinical document feature extraction

Text clustering

Semantic search

Use Cases

Clinical Medicine

Medical Term Matching

Identifies terms that refer to the same medical concept despite different expressions

Achieves state-of-the-art performance on the MayoSRS dataset

Electronic Health Record Analysis

Extracts and associates relevant medical concepts from clinical notes

Biomedical Research

Biomedical Ontology Alignment

Facilitates the integration of biomedical ontology data from different sources

🚀 FremyCompany/BioLORD-STAMB2-v1

This model is trained using BioLORD, a novel pre - training strategy. It can generate meaningful representations for clinical sentences and biomedical concepts, offering high - quality text similarity performance in relevant fields.

🚀 Quick Start

This model was introduced in 2022, and a new version has been released since then. For most use cases, BioLORD - 2023, our latest generation of BioLORD models, is recommended.

State - of - the - art methods maximize the representation similarity of names referring to the same concept and prevent collapse via contrastive learning. However, due to the non - self - explanatory nature of biomedical names, non - semantic representations may occur. BioLORD addresses this by grounding concept representations with definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies. This results in more semantic concept representations that align better with the hierarchical structure of ontologies. BioLORD sets a new standard for text similarity in both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).

This model is based on sentence - transformers/all - mpnet - base - v2 and is further fine - tuned on the BioLORD - Dataset.

✨ Features

General Purpose

This is a sentence - transformers model. It maps sentences and paragraphs to a 768 - dimensional dense vector space and can be used for tasks like clustering or semantic search. Finetuned for the biomedical domain, it retains the ability to generate embeddings for general - purpose text but is more suitable for processing medical documents such as EHR records or clinical notes. Both sentences and phrases can be embedded in the same latent space.

📦 Installation

Using this model becomes easy when you have sentence - transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence - Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

Without sentence - transformers, you can use the model like this: First, pass your input through the transformer model, then apply the right pooling - operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Citation

This model accompanies the BioLORD: Learning Ontological Representations from Definitions paper, accepted in the EMNLP 2022 Findings. When you use this model, please cite the original paper as follows:

@inproceedings{remy-etal-2022-biolord,
    title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
    author = "Remy, François  and
      Demuynck, Kris  and
      Demeester, Thomas",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.104",
    pages = "1454--1465",
    abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}

You might also want to take a look at our MWE 2023 Paper:

Detecting Idiomatic Multiword Expressions in Clinical Terminology using Definition - Based Representation Learning

📄 License

My own contributions for this model are covered by the MIT license. However, given the data used to train this model originates from UMLS, you will need to ensure you have proper licensing of UMLS before using this model. UMLS is free of charge in most countries, but you might have to create an account and report on your usage of the data yearly to keep a valid license.

📄 Information Table

Property	Details
Model Type	A sentence - transformers model finetuned for the biomedical domain
Training Data	BioLORD - Dataset

⚠️ Important Note

This model was introduced in 2022. Since then, a new version has been published. For most use cases, you will be better served by BioLORD - 2023, our latest generation of BioLORD models.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご