BioLORD-2023-M Open-source Multilingual Biomedical Model - Precise Calculation of Sentence Similarity

Biolord 2023 M

Developed by FremyCompany

BioLORD-2023-M is a multilingual biomedical sentence similarity model that generates semantic representations through definitions and knowledge graphs

Text Embedding

Safetensors

Supports Multiple LanguagesOpen Source License:Other #Biomedical Semantic Similarity #Multilingual Clinical Text #Ontology Knowledge Enhancement

Downloads 1,701

Release Time : 11/27/2023

Model Overview

This model adopts the BioLORD pre-training strategy to generate meaningful representations for clinical sentences and biomedical concepts. By anchoring concept representations using definitions and short descriptions from biomedical ontology knowledge graphs, it produces semantic concept representations that better align with the ontological hierarchy.

Model Features

Multilingual Support

Officially supports 7 European languages, with unofficial support for additional languages

Ontology-Aware Representation

Generates semantic representations aligned with ontological hierarchy through definitions and knowledge graphs

Three-Stage Training Strategy

Employs a three-stage training approach combining contrastive learning, self-distillation, and model averaging

Model Capabilities

Biomedical Text Embedding

Clinical Sentence Similarity Calculation

Multilingual Text Representation

Biomedical Concept Similarity Analysis

Use Cases

Medical Information Processing

Electronic Health Record Analysis

Used for analyzing clinical text similarity in electronic health records

Achieves state-of-the-art performance on the MedSTS dataset

Biomedical Concept Matching

Identifies identical biomedical concepts expressed differently

Demonstrates excellent performance on the EHR-Rel-B dataset

Multilingual Medical Applications

Cross-Language Medical Information Retrieval

Supports medical information retrieval in multiple European languages

🚀 FremyCompany/BioLORD-2023-M

This model is based on BioLORD, a novel pre - training strategy, to generate meaningful representations for clinical sentences and biomedical concepts, achieving state - of - the - art results in text similarity tasks.

🚀 Quick Start

This model is designed to map sentences and paragraphs to a 768 - dimensional dense vector space, suitable for tasks like clustering or semantic search. It has been fine - tuned for the biomedical domain, making it particularly useful for processing medical documents such as EHR records or clinical notes.

✨ Features

Semantic Representation: Overcomes the limitations of traditional methods by grounding concept representations using definitions and short descriptions from a multi - relational knowledge graph of biomedical ontologies, producing more semantic concept representations.
Multilingual Support: Officially supports 7 European languages (English, Spanish, French, German, Dutch, Danish, and Swedish), and many other languages unofficially.
New State of the Art: Establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (EHR - Rel - B).

📦 Installation

If you want to use this model with sentence - transformers, you need to install it first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

Using Sentence - Transformers

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-2023-M')
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-2023-M')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-2023-M')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Sibling models

This model is part of the BioLORD - 2023 series. You might also be interested in the following models:

BioLORD - 2023 - M (multilingual model; distilled from BioLORD - 2023; this model)
BioLORD - 2023 (best monolingual English model; after model averaging)
BioLORD - 2023 - S (best monolingual English model; no model averaging)
BioLORD - 2023 - C (monolingual English model; contrastive training only)

You can also refer to last year's model and paper:

BioLORD - 2022 (also known as BioLORD - STAMB2 - v1)

Training strategy

Summary of the 3 phases

image/png

Contrastive phase: details

image/png

Self - distallation phase: details

image/png

Citation

This model accompanies the BioLORD - 2023: Learning Ontological Representations from Definitions paper. When using this model, please cite the original paper as follows:

@article{remy-etal-2023-biolord,
    author = {Remy, Fran√ßois and Demuynck, Kris and Demeester, Thomas},
    title = "{BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights}",
    journal = {Journal of the American Medical Informatics Association},
    pages = {ocae029},
    year = {2024},
    month = {02},
    issn = {1527-974X},
    doi = {10.1093/jamia/ocae029},
    url = {https://doi.org/10.1093/jamia/ocae029},
    eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae029/56772025/ocae029.pdf},
}

📄 License

My own contributions for this model are covered by the MIT license. However, since the data used to train this model originates from UMLS and SnomedCT, you need to ensure you have proper licensing of UMLS and SnomedCT before using this model. Both UMLS and SnomedCT are free of charge in most countries, but you may have to create an account and report on your usage of the data yearly to keep a valid license.

Property	Details
Pipeline Tag	sentence - similarity
Tags	sentence - transformers, feature - extraction, sentence - similarity, medical, biology
Supported Languages	English, Spanish, French, German, Dutch, Danish, Swedish (officially); many other languages (unofficially)
Model Type	Based on [sentence - transformers/all - mpnet - base - v2](https://huggingface.co/sentence - transformers/all - mpnet - base - v2), finetuned on [BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset) and LLM - generated definitions from [Automatic Glossary of Clinical Terminology (AGCT)](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset)
Training Data	[BioLORD - Dataset](https://huggingface.co/datasets/FremyCompany/BioLORD - Dataset), LLM - generated definitions from [Automatic Glossary of Clinical Terminology (AGCT)](https://huggingface.co/datasets/FremyCompany/AGCT - Dataset)
License	MIT for my contributions; need proper UMLS and SnomedCT licensing
License Name	ihtsdo - and - nlm - licences
License Link	https://www.nlm.nih.gov/databases/umls.html

⚠️ Important Note

If you are able to, please help me fund my open research. Thank you for your generosity!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご