BioLORD-2023-C開源模型 - 免費生成生物醫學與臨床文本有價值表示

首頁

Biolord 2023 C

由FremyCompany開發

BioLORD-2023-C是一種基於BioLORD訓練的句子轉換器模型，專注於生成生物醫學和臨床文本的有意義表示。

文本嵌入英語開源協議:其他 #生物醫學語義相似度 #臨床概念嵌入 #本體知識增強

下載量 188.08k

發布時間 : 2/12/2024

模型概述

該模型通過使用定義和從生物醫學本體知識圖譜中提取的簡短描述來錨定概念表示，生成更符合本體層次結構的語義概念表示。適用於臨床句子和生物醫學概念的文本相似度任務。

模型特點

語義概念表示

通過使用定義和知識圖譜描述錨定概念表示，生成更符合本體層次結構的語義表示。

多階段訓練

採用三階段訓練策略，包括對比學習階段和自蒸餾階段，優化模型性能。

生物醫學優化

專門針對生物醫學和臨床領域進行優化，處理電子健康記錄和臨床筆記等醫學文檔效果更佳。

模型能力

句子相似度計算

生物醫學文本特徵提取

臨床文本嵌入生成

使用案例

醫療信息處理

臨床筆記分析

分析電子健康記錄中的臨床筆記，提取關鍵信息。

生成有意義的文本表示，便於後續分析和處理。

生物醫學概念匹配

匹配不同表達方式的生物醫學概念，如'貓抓病'和'巴爾通體病'。

準確識別語義相似的概念。

🚀 FremyCompany/BioLORD-2023-C

本模型旨在解決臨床句子和生物醫學概念的有意義表示問題，通過新的預訓練策略BioLORD進行訓練，能在臨床句子和生物醫學概念的文本相似度任務上達到新的最優效果。

🚀 快速開始

本模型是一個基於sentence-transformers的模型，它可以將句子和段落映射到768維的密集向量空間，可用於聚類或語義搜索等任務。該模型針對生物醫學領域進行了微調，在處理醫學文檔（如電子健康記錄或臨床筆記）時會更有用。

安裝依賴

pip install -U sentence-transformers

代碼示例

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-2023-C')
embeddings = model.encode(sentences)
print(embeddings)

✨ 主要特性

創新預訓練策略：使用BioLORD預訓練策略，利用定義和多關係知識圖譜中的簡短描述來構建概念表示，克服了傳統方法可能產生非語義表示的問題。
語義匹配度高：生成的概念表示更具語義性，能更好地匹配本體的層次結構。
領域針對性強：針對生物醫學領域進行微調，在處理醫學文檔時表現更優。

📦 安裝指南

若要使用此模型，需安裝sentence-transformers庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-2023-C')
embeddings = model.encode(sentences)
print(embeddings)

高級用法

若不使用sentence-transformers庫，可按以下方式使用模型：

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-2023-C')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-2023-C')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 詳細文檔

模型背景

當前最先進的方法通過最大化指代同一概念的名稱表示的相似性，並通過對比學習防止崩潰。但由於生物醫學名稱並非總是自解釋的，有時會導致非語義表示。BioLORD通過使用定義以及從由生物醫學本體組成的多關係知識圖譜中提取的簡短描述來構建其概念表示，克服了這一問題。

訓練策略

三階段概述

image/png

對比階段詳情

image/png

自蒸餾階段詳情

image/png

引用信息

本模型伴隨論文BioLORD - 2023: Learning Ontological Representations from Definitions。使用此模型時，請按以下方式引用原文：

@article{remy-etal-2023-biolord,
    author = {Remy, François and Demuynck, Kris and Demeester, Thomas},
    title = "{BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights}",
    journal = {Journal of the American Medical Informatics Association},
    pages = {ocae029},
    year = {2024},
    month = {02},
    issn = {1527-974X},
    doi = {10.1093/jamia/ocae029},
    url = {https://doi.org/10.1093/jamia/ocae029},
    eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae029/56772025/ocae029.pdf},
}