BioLORD-STAMB2-v1開源模型 - 免費部署實現臨床語句和生物醫學概念語義表徵

首頁

Biolord STAMB2 V1

由FremyCompany開發

BioLORD是一種為臨床語句和生物醫學概念生成語義化表徵的新型預訓練策略模型

文本嵌入

PyTorch

英語開源協議:其他 #生物醫學語義嵌入 #臨床術語相似度 #本體表徵學習

下載量 49

發布時間 : 10/20/2022

模型概述

該模型通過將概念表徵錨定於定義及生物醫學本體論衍生的簡短描述，生成更貼合本體層次結構的語義化表徵，適用於處理電子健康記錄（EHR）或臨床筆記等醫療文檔。

模型特點

語義化表徵生成

通過錨定概念定義和本體論描述，生成符合生物醫學本體層次結構的語義化表徵

生物醫學領域優化

專為生物醫學領域微調，能高效處理臨床文檔和醫學術語

多任務支持

同時支持臨床語句和生物醫學概念的相似度計算

模型能力

句子相似度計算

生物醫學概念表徵生成

臨床文檔特徵提取

文本聚類

語義搜索

使用案例

臨床醫學

醫學術語匹配

識別不同表達方式但指向同一醫學概念的術語

在MayoSRS數據集上達到先進水平

電子健康記錄分析

從臨床筆記中提取和關聯相關醫學概念

生物醫學研究

生物醫學本體對齊

幫助整合不同來源的生物醫學本體數據

🚀 FremyCompany/BioLORD-STAMB2-v1

本模型使用BioLORD進行訓練，BioLORD是一種全新的預訓練策略，用於為臨床句子和生物醫學概念生成有意義的表示。該模型在臨床句子（MedSTS）和生物醫學概念（MayoSRS）的文本相似度任務上達到了新的最優水平。

⚠️ 重要提示

此模型於2022年推出，自那時起，我們已發佈了新版本。對於大多數用例，使用我們最新一代的BioLORD模型 BioLORD - 2023 會更合適。

當前的先進方法通過最大化指代同一概念的名稱表示之間的相似度，並通過對比學習防止表示崩潰來進行操作。然而，由於生物醫學名稱並非總是能自解釋，有時會導致非語義表示。

BioLORD通過使用定義以及從包含生物醫學本體的多關係知識圖中提取的簡短描述來錨定其概念表示，從而克服了這一問題。得益於這種錨定，我們的模型生成了更具語義的概念表示，這些表示更緊密地匹配本體的層次結構。BioLORD在臨床句子（MedSTS）和生物醫學概念（MayoSRS）的文本相似度任務上確立了新的最優水平。

本模型基於 sentence - transformers/all - mpnet - base - v2，並在 BioLORD - 數據集上進行了進一步微調。

✨ 主要特性

這是一個 sentence - transformers 模型，可將句子和段落映射到768維的密集向量空間，適用於聚類或語義搜索等任務。
該模型針對生物醫學領域進行了微調，在處理醫學文檔（如電子健康記錄或臨床筆記）時表現更優，同時也能為通用文本生成嵌入。
句子和短語可嵌入到相同的潛在空間中。

📦 安裝指南

若要使用此模型，需安裝 sentence - transformers：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)

高級用法

若未安裝 sentence - transformers，可按以下方式使用模型：首先將輸入傳遞給Transformer模型，然後對上下文詞嵌入應用正確的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 詳細文檔

模型相關信息

屬性	詳情
模型類型	基於 sentence - transformers/all - mpnet - base - v2 微調的生物醫學領域模型
訓練數據	BioLORD - 數據集

引用信息

本模型伴隨論文 BioLORD: Learning Ontological Representations from Definitions，該論文已被EMNLP 2022 Findings收錄。使用此模型時，請按以下方式引用原論文：

@inproceedings{remy-etal-2022-biolord,
    title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
    author = "Remy, François  and
      Demuynck, Kris  and
      Demeester, Thomas",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.104",
    pages = "1454--1465",
    abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}

你可能還想查看我們的MWE 2023論文：