模型概述
模型特點
模型能力
使用案例
🚀 FremyCompany/BioLORD-STAMB2-v1
本模型使用BioLORD進行訓練,BioLORD是一種全新的預訓練策略,用於為臨床句子和生物醫學概念生成有意義的表示。該模型在臨床句子(MedSTS)和生物醫學概念(MayoSRS)的文本相似度任務上達到了新的最優水平。
⚠️ 重要提示
此模型於2022年推出,自那時起,我們已發佈了新版本。對於大多數用例,使用我們最新一代的BioLORD模型 BioLORD - 2023 會更合適。
當前的先進方法通過最大化指代同一概念的名稱表示之間的相似度,並通過對比學習防止表示崩潰來進行操作。然而,由於生物醫學名稱並非總是能自解釋,有時會導致非語義表示。
BioLORD通過使用定義以及從包含生物醫學本體的多關係知識圖中提取的簡短描述來錨定其概念表示,從而克服了這一問題。得益於這種錨定,我們的模型生成了更具語義的概念表示,這些表示更緊密地匹配本體的層次結構。BioLORD在臨床句子(MedSTS)和生物醫學概念(MayoSRS)的文本相似度任務上確立了新的最優水平。
本模型基於 sentence - transformers/all - mpnet - base - v2,並在 BioLORD - 數據集 上進行了進一步微調。
✨ 主要特性
- 這是一個 sentence - transformers 模型,可將句子和段落映射到768維的密集向量空間,適用於聚類或語義搜索等任務。
- 該模型針對生物醫學領域進行了微調,在處理醫學文檔(如電子健康記錄或臨床筆記)時表現更優,同時也能為通用文本生成嵌入。
- 句子和短語可嵌入到相同的潛在空間中。
📦 安裝指南
若要使用此模型,需安裝 sentence - transformers:
pip install -U sentence-transformers
💻 使用示例
基礎用法
from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)
高級用法
若未安裝 sentence - transformers,可按以下方式使用模型:首先將輸入傳遞給Transformer模型,然後對上下文詞嵌入應用正確的池化操作。
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
📚 詳細文檔
模型相關信息
屬性 | 詳情 |
---|---|
模型類型 | 基於 sentence - transformers/all - mpnet - base - v2 微調的生物醫學領域模型 |
訓練數據 | BioLORD - 數據集 |
引用信息
本模型伴隨論文 BioLORD: Learning Ontological Representations from Definitions,該論文已被EMNLP 2022 Findings收錄。使用此模型時,請按以下方式引用原論文:
@inproceedings{remy-etal-2022-biolord,
title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
author = "Remy, François and
Demuynck, Kris and
Demeester, Thomas",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.104",
pages = "1454--1465",
abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}
你可能還想查看我們的MWE 2023論文:
📄 許可證
本模型中我個人的貢獻遵循MIT許可證。然而,由於訓練此模型所使用的數據源自UMLS,在使用此模型前,你需要確保擁有UMLS的正確許可。UMLS在大多數國家是免費的,但你可能需要創建一個賬戶並每年報告數據使用情況以維持有效許可。







