模型简介
模型特点
模型能力
使用案例
🚀 FremyCompany/BioLORD-STAMB2-v1
本模型使用BioLORD进行训练,BioLORD是一种全新的预训练策略,用于为临床句子和生物医学概念生成有意义的表示。该模型在临床句子(MedSTS)和生物医学概念(MayoSRS)的文本相似度任务上达到了新的最优水平。
⚠️ 重要提示
此模型于2022年推出,自那时起,我们已发布了新版本。对于大多数用例,使用我们最新一代的BioLORD模型 BioLORD - 2023 会更合适。
当前的先进方法通过最大化指代同一概念的名称表示之间的相似度,并通过对比学习防止表示崩溃来进行操作。然而,由于生物医学名称并非总是能自解释,有时会导致非语义表示。
BioLORD通过使用定义以及从包含生物医学本体的多关系知识图中提取的简短描述来锚定其概念表示,从而克服了这一问题。得益于这种锚定,我们的模型生成了更具语义的概念表示,这些表示更紧密地匹配本体的层次结构。BioLORD在临床句子(MedSTS)和生物医学概念(MayoSRS)的文本相似度任务上确立了新的最优水平。
本模型基于 sentence - transformers/all - mpnet - base - v2,并在 BioLORD - 数据集 上进行了进一步微调。
✨ 主要特性
- 这是一个 sentence - transformers 模型,可将句子和段落映射到768维的密集向量空间,适用于聚类或语义搜索等任务。
- 该模型针对生物医学领域进行了微调,在处理医学文档(如电子健康记录或临床笔记)时表现更优,同时也能为通用文本生成嵌入。
- 句子和短语可嵌入到相同的潜在空间中。
📦 安装指南
若要使用此模型,需安装 sentence - transformers:
pip install -U sentence-transformers
💻 使用示例
基础用法
from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)
高级用法
若未安装 sentence - transformers,可按以下方式使用模型:首先将输入传递给Transformer模型,然后对上下文词嵌入应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
📚 详细文档
模型相关信息
属性 | 详情 |
---|---|
模型类型 | 基于 sentence - transformers/all - mpnet - base - v2 微调的生物医学领域模型 |
训练数据 | BioLORD - 数据集 |
引用信息
本模型伴随论文 BioLORD: Learning Ontological Representations from Definitions,该论文已被EMNLP 2022 Findings收录。使用此模型时,请按以下方式引用原论文:
@inproceedings{remy-etal-2022-biolord,
title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
author = "Remy, François and
Demuynck, Kris and
Demeester, Thomas",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.104",
pages = "1454--1465",
abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}
你可能还想查看我们的MWE 2023论文:
📄 许可证
本模型中我个人的贡献遵循MIT许可证。然而,由于训练此模型所使用的数据源自UMLS,在使用此模型前,你需要确保拥有UMLS的正确许可。UMLS在大多数国家是免费的,但你可能需要创建一个账户并每年报告数据使用情况以维持有效许可。







