开源BiomedVLP - BioViL-T模型 - 免费分析胸部X光片与放射学报告

首页

Biomedvlp BioViL T

由 microsoft 开发

BioViL-T是一个专注于分析胸部X光片和放射学报告的视觉语言模型，通过时序多模态预训练提升性能。

多模态融合

Transformers

英语开源协议:MIT #胸部X光分析 #时序多模态预训练 #放射学报告生成

下载量 26.39k

发布时间 : 2/17/2023

模型简介

BioViL-T是一个领域特定的视觉语言模型，专注于胸部X光片(CXRs)和放射学报告的分析。该模型采用时序多模态预训练方法，在图像和文本模态以及联合空间中嵌入时序信息，显著提升了多个下游任务的性能。

模型特点

时序多模态预训练

充分利用数据点之间的时序结构，在保持相同训练数据集的情况下提升下游任务性能。

跨模态对齐

利用[CLS]标记的潜在表征对齐文本和图像嵌入，实现更好的跨模态理解。

领域特定优化

专门针对胸部X光片和放射学报告领域进行优化，在相关任务上表现优异。

两阶段训练

语言模型先进行通用生物医学领域预训练，再进行放射学领域特定训练，提高专业性。

模型能力

胸部X光片分析

放射学报告理解

自然语言推理

短语定位

图像分类

文本分类

语言解码

跨模态检索

使用案例

医学影像分析

胸部X光片异常检测

分析胸部X光片并检测异常情况，如胸腔积液或气胸。

在MS-CXR-T基准测试上达到87.77%的准确率

放射学报告生成

根据胸部X光片生成或补充放射学报告。

医学研究

医学影像语言处理研究

支持AI研究人员探索临床NLP和VLP研究问题。

🚀 BioViL-T

BioViL-T 是一款特定领域的视觉语言模型，旨在分析胸部X光片（CXR）和放射学报告。它通过一种时间多模态预训练程序进行训练，这使其有别于其前身模型（BioViL）。具体而言，BioViL-T 利用了数据点之间的时间结构，在使用与前身相同训练数据集的情况下，提升了多个基准测试的下游性能。特别是，该模型在嵌入图像和文本模态中的时间信息（见结果）以及联合空间方面表现出显著改进。该标准模型可适用于单图像和多图像下游应用，包括自然语言推理、短语定位、图像/文本分类和语言解码。

🚀 快速开始

模型使用示例

以下是如何使用此模型提取放射学句子嵌入并在联合空间（图像和文本）中获取它们的余弦相似度：

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
url = "microsoft/BiomedVLP-BioViL-T"
tokenizer = AutoTokenizer.from_pretrained(url, trust_remote_code=True)
model = AutoModel.from_pretrained(url, trust_remote_code=True)

# Input text prompts describing findings.
# The order of prompts is adjusted to capture the spectrum from absence of a finding to its temporal progression.
text_prompts = ["No pleural effusion or pneumothorax is seen.",
                "There is no pneumothorax or pleural effusion.",
                "The extent of the pleural effusion is reduced.",
                "The extent of the pleural effusion remains constant.",
                "Interval enlargement of pleural effusion."]

# Tokenize and compute the sentence embeddings
with torch.no_grad():
    tokenizer_output = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_prompts,
                                                   add_special_tokens=True,
                                                   padding='longest',
                                                   return_tensors='pt')
    embeddings = model.get_projected_text_embeddings(input_ids=tokenizer_output.input_ids,
                                                 attention_mask=tokenizer_output.attention_mask)

    # Compute the cosine similarity of sentence embeddings obtained from input text prompts.
    sim = torch.mm(embeddings, embeddings.t())

✨ 主要特性

时间多模态预训练：利用数据点之间的时间结构，提升下游性能。
广泛的下游应用：适用于自然语言推理、短语定位、图像/文本分类和语言解码等单图像和多图像下游应用。
改进的嵌入能力：在嵌入图像和文本模态中的时间信息以及联合空间方面表现出色。

📦 安装指南

文档未提及具体安装步骤，暂无法提供。

💻 使用示例

基础用法

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
url = "microsoft/BiomedVLP-BioViL-T"
tokenizer = AutoTokenizer.from_pretrained(url, trust_remote_code=True)
model = AutoModel.from_pretrained(url, trust_remote_code=True)

# Input text prompts describing findings.
# The order of prompts is adjusted to capture the spectrum from absence of a finding to its temporal progression.
text_prompts = ["No pleural effusion or pneumothorax is seen.",
                "There is no pneumothorax or pleural effusion.",
                "The extent of the pleural effusion is reduced.",
                "The extent of the pleural effusion remains constant.",
                "Interval enlargement of pleural effusion."]

# Tokenize and compute the sentence embeddings
with torch.no_grad():
    tokenizer_output = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_prompts,
                                                   add_special_tokens=True,
                                                   padding='longest',
                                                   return_tensors='pt')
    embeddings = model.get_projected_text_embeddings(input_ids=tokenizer_output.input_ids,
                                                 attention_mask=tokenizer_output.attention_mask)

    # Compute the cosine similarity of sentence embeddings obtained from input text prompts.
    sim = torch.mm(embeddings, embeddings.t())

📚 详细文档

语言模型变体

属性	详情
模型类型	CXR-BERT-general、CXR-BERT-specialized、BioViL-T
模型标识符	microsoft/BiomedVLP-CXR-BERT-general、microsoft/BiomedVLP-CXR-BERT-specialized、microsoft/BiomedVLP-BioViL-T
词汇表	PubMed & MIMIC
说明	CXR-BERT-general 针对生物医学文献和临床领域进行预训练；CXR-BERT-specialized 针对 CXR 领域进行静态预训练；BioViL-T 针对 CXR 领域进行静态和时间预训练

图像模型

图像模型与文本模型在多模态对比学习框架中联合训练。它是一个混合图像编码器，由视觉变换器（Vision Transformer）和 ResNet - 50 组成，后者用作骨干网络，在每个时间点从图像中提取特征。设计中包含变换器是为了聚合和比较跨时间维度提取的图像特征。相应的模型定义及其加载函数可通过我们的 HI - ML - Multimodal GitHub 仓库访问。联合图像和文本模型，即 BioViL-T，可用于短语定位应用，如这个 Python 笔记本示例所示。此外，请查看 MS - CXR 基准，以更系统地评估联合图像和文本模型在短语定位任务中的性能。

数据

该模型基于现有的公开数据集构建：

这些数据集涵盖了从生物医学摘要到重症监护室记录再到胸部X光放射学记录等广泛的来源。在 MIMIC - CXR 数据集中，放射学记录伴随着相关的胸部X光 DICOM 图像。

性能

所提出的模型通过在训练时更有效地利用语义和话语特征，在放射学自然语言推理中取得了最先进的成果。实验在 RadNLI 和 MS - CXR - T 基准上进行，分别从静态和时间语义方面衡量文本嵌入的质量。BioViL - T 与其他常用的最先进特定领域 BERT 模型进行了基准测试，包括 PubMedBERT 和 CXR - BERT。以下结果表明，BioViL - T 在捕捉静态内容（RadNLI）的同时，提高了句子嵌入对时间内容（MS - CXR - T）的敏感性。

模型	MS - CXR - T 准确率	MS - CXR - T ROC - AUC	RadNLI (2 类) 准确率	RadNLI (2 类) ROC - AUC
PubMedBERT	60.39	0.542	81.38	0.727
CXR - BERT - General	62.60	0.601	87.59	0.902
CXR - BERT - Specialized	78.12	0.837	89.66	0.932
BioViL - T	87.77	0.933	90.52	0.947

新颖的预训练框架还产生了更好的视觉语言表示。以下是在 MS - CXR 基准数据集上获得的零样本短语定位性能，该数据集评估了图像 - 文本潜在表示的质量。

视觉 - 语言预训练方法	MS - CXR 短语定位 (平均 CNR 分数)	MS - CXR 短语定位 (mIoU)
BioViL	1.07 ± 0.04	0.229 ± 0.005
BioViL - L	1.21 ± 0.05	0.202 ± 0.010
BioViL - T	1.33 ± 0.04	0.240 ± 0.005

更多实验结果和讨论可在相应论文 "Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing", CVPR'23 中找到。

局限性

语言限制：该模型使用英语语料库开发，因此可视为仅支持英语。
数据局限性：训练数据集仅包含从重症监护室（ICU）获取的医学图像和报告，其中纵向图像通常在数小时或最多几天内收集。因此，在分析长时间（例如数年）获取的连续图像时，由于扫描之间观察到显著的解剖学变化，模型性能可能会下降。

🔧 技术细节

文档未提及具体技术细节，暂无法提供。

📄 许可证

本项目采用 MIT 许可证。

🔗 引用

相应的论文已被接受在 计算机视觉与模式识别会议 (CVPR) 2023 上展示。

@misc{https://doi.org/10.48550/arXiv.2301.04558,
  doi = {10.48550/ARXIV.2301.04558},
  url = {https://arxiv.org/abs/2301.04558},
  author = {Bannur, Shruthi and Hyland, Stephanie and Liu, Qianchu and Perez-Garcia, Fernando and Ilse, Maximilian and Castro, Daniel C and Boecking, Benedikt and Sharma, Harshita and Bouzid, Kenza and Thieme, Anja and Schwaighofer, Anton and Wetscherek, Maria and Lungren, Matthew P and Nori, Aditya and Alvarez-Valle, Javier and Oktay, Ozan},
  title = {Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing},
  publisher = {arXiv},
  year = {2023},
}

⚠️ 重要提示

本模型仅用于（I）未来视觉语言处理研究和（II）复现参考论文中报告的实验结果。
模型的任何部署用例（商业或其他）目前不在范围内。尽管我们使用了广泛的公开研究基准对模型进行了评估，但模型和评估并非用于部署用例。在前所未有的情况下，模型可能会做出不准确的预测并显示出局限性，这可能需要额外的缓解策略。因此，我们不建议将该模型用于自动诊断或医疗设备。更多详情请参考相关论文。