開源BiomedVLP - BioViL-T模型 - 免費分析胸部X光片與放射學報告

首頁

Biomedvlp BioViL T

由microsoft開發

BioViL-T是一個專注於分析胸部X光片和放射學報告的視覺語言模型，通過時序多模態預訓練提升性能。

多模態融合

Transformers

英語開源協議:MIT #胸部X光分析 #時序多模態預訓練 #放射學報告生成

下載量 26.39k

發布時間 : 2/17/2023

模型概述

BioViL-T是一個領域特定的視覺語言模型，專注於胸部X光片(CXRs)和放射學報告的分析。該模型採用時序多模態預訓練方法，在圖像和文本模態以及聯合空間中嵌入時序信息，顯著提升了多個下游任務的性能。

模型特點

時序多模態預訓練

充分利用數據點之間的時序結構，在保持相同訓練數據集的情況下提升下游任務性能。

跨模態對齊

利用[CLS]標記的潛在表徵對齊文本和圖像嵌入，實現更好的跨模態理解。

領域特定優化

專門針對胸部X光片和放射學報告領域進行優化，在相關任務上表現優異。

兩階段訓練

語言模型先進行通用生物醫學領域預訓練，再進行放射學領域特定訓練，提高專業性。

模型能力

胸部X光片分析

放射學報告理解

自然語言推理

短語定位

圖像分類

文本分類

語言解碼

跨模態檢索

使用案例

醫學影像分析

胸部X光片異常檢測

分析胸部X光片並檢測異常情況，如胸腔積液或氣胸。

在MS-CXR-T基準測試上達到87.77%的準確率

放射學報告生成

根據胸部X光片生成或補充放射學報告。

醫學研究

醫學影像語言處理研究

支持AI研究人員探索臨床NLP和VLP研究問題。

🚀 BioViL-T

BioViL-T 是一款特定領域的視覺語言模型，旨在分析胸部X光片（CXR）和放射學報告。它通過一種時間多模態預訓練程序進行訓練，這使其有別於其前身模型（BioViL）。具體而言，BioViL-T 利用了數據點之間的時間結構，在使用與前身相同訓練數據集的情況下，提升了多個基準測試的下游性能。特別是，該模型在嵌入圖像和文本模態中的時間信息（見結果）以及聯合空間方面表現出顯著改進。該標準模型可適用於單圖像和多圖像下游應用，包括自然語言推理、短語定位、圖像/文本分類和語言解碼。

🚀 快速開始

模型使用示例

以下是如何使用此模型提取放射學句子嵌入並在聯合空間（圖像和文本）中獲取它們的餘弦相似度：

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
url = "microsoft/BiomedVLP-BioViL-T"
tokenizer = AutoTokenizer.from_pretrained(url, trust_remote_code=True)
model = AutoModel.from_pretrained(url, trust_remote_code=True)

# Input text prompts describing findings.
# The order of prompts is adjusted to capture the spectrum from absence of a finding to its temporal progression.
text_prompts = ["No pleural effusion or pneumothorax is seen.",
                "There is no pneumothorax or pleural effusion.",
                "The extent of the pleural effusion is reduced.",
                "The extent of the pleural effusion remains constant.",
                "Interval enlargement of pleural effusion."]

# Tokenize and compute the sentence embeddings
with torch.no_grad():
    tokenizer_output = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_prompts,
                                                   add_special_tokens=True,
                                                   padding='longest',
                                                   return_tensors='pt')
    embeddings = model.get_projected_text_embeddings(input_ids=tokenizer_output.input_ids,
                                                 attention_mask=tokenizer_output.attention_mask)

    # Compute the cosine similarity of sentence embeddings obtained from input text prompts.
    sim = torch.mm(embeddings, embeddings.t())

✨ 主要特性

時間多模態預訓練：利用數據點之間的時間結構，提升下游性能。
廣泛的下游應用：適用於自然語言推理、短語定位、圖像/文本分類和語言解碼等單圖像和多圖像下游應用。
改進的嵌入能力：在嵌入圖像和文本模態中的時間信息以及聯合空間方面表現出色。

📦 安裝指南

文檔未提及具體安裝步驟，暫無法提供。

💻 使用示例

基礎用法

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
url = "microsoft/BiomedVLP-BioViL-T"
tokenizer = AutoTokenizer.from_pretrained(url, trust_remote_code=True)
model = AutoModel.from_pretrained(url, trust_remote_code=True)

# Input text prompts describing findings.
# The order of prompts is adjusted to capture the spectrum from absence of a finding to its temporal progression.
text_prompts = ["No pleural effusion or pneumothorax is seen.",
                "There is no pneumothorax or pleural effusion.",
                "The extent of the pleural effusion is reduced.",
                "The extent of the pleural effusion remains constant.",
                "Interval enlargement of pleural effusion."]

# Tokenize and compute the sentence embeddings
with torch.no_grad():
    tokenizer_output = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_prompts,
                                                   add_special_tokens=True,
                                                   padding='longest',
                                                   return_tensors='pt')
    embeddings = model.get_projected_text_embeddings(input_ids=tokenizer_output.input_ids,
                                                 attention_mask=tokenizer_output.attention_mask)

    # Compute the cosine similarity of sentence embeddings obtained from input text prompts.
    sim = torch.mm(embeddings, embeddings.t())

📚 詳細文檔

語言模型變體

屬性	詳情
模型類型	CXR-BERT-general、CXR-BERT-specialized、BioViL-T
模型標識符	microsoft/BiomedVLP-CXR-BERT-general、microsoft/BiomedVLP-CXR-BERT-specialized、microsoft/BiomedVLP-BioViL-T
詞彙表	PubMed & MIMIC
說明	CXR-BERT-general 針對生物醫學文獻和臨床領域進行預訓練；CXR-BERT-specialized 針對 CXR 領域進行靜態預訓練；BioViL-T 針對 CXR 領域進行靜態和時間預訓練

圖像模型

圖像模型與文本模型在多模態對比學習框架中聯合訓練。它是一個混合圖像編碼器，由視覺變換器（Vision Transformer）和 ResNet - 50 組成，後者用作骨幹網絡，在每個時間點從圖像中提取特徵。設計中包含變換器是為了聚合和比較跨時間維度提取的圖像特徵。相應的模型定義及其加載函數可通過我們的 HI - ML - Multimodal GitHub 倉庫訪問。聯合圖像和文本模型，即 BioViL-T，可用於短語定位應用，如這個 Python 筆記本示例所示。此外，請查看 MS - CXR 基準，以更系統地評估聯合圖像和文本模型在短語定位任務中的性能。

數據

該模型基於現有的公開數據集構建：

這些數據集涵蓋了從生物醫學摘要到重症監護室記錄再到胸部X光放射學記錄等廣泛的來源。在 MIMIC - CXR 數據集中，放射學記錄伴隨著相關的胸部X光 DICOM 圖像。

性能

所提出的模型通過在訓練時更有效地利用語義和話語特徵，在放射學自然語言推理中取得了最先進的成果。實驗在 RadNLI 和 MS - CXR - T 基準上進行，分別從靜態和時間語義方面衡量文本嵌入的質量。BioViL - T 與其他常用的最先進特定領域 BERT 模型進行了基準測試，包括 PubMedBERT 和 CXR - BERT。以下結果表明，BioViL - T 在捕捉靜態內容（RadNLI）的同時，提高了句子嵌入對時間內容（MS - CXR - T）的敏感性。

模型	MS - CXR - T 準確率	MS - CXR - T ROC - AUC	RadNLI (2 類) 準確率	RadNLI (2 類) ROC - AUC
PubMedBERT	60.39	0.542	81.38	0.727
CXR - BERT - General	62.60	0.601	87.59	0.902
CXR - BERT - Specialized	78.12	0.837	89.66	0.932
BioViL - T	87.77	0.933	90.52	0.947

新穎的預訓練框架還產生了更好的視覺語言表示。以下是在 MS - CXR 基準數據集上獲得的零樣本短語定位性能，該數據集評估了圖像 - 文本潛在表示的質量。

視覺 - 語言預訓練方法	MS - CXR 短語定位 (平均 CNR 分數)	MS - CXR 短語定位 (mIoU)
BioViL	1.07 ± 0.04	0.229 ± 0.005
BioViL - L	1.21 ± 0.05	0.202 ± 0.010
BioViL - T	1.33 ± 0.04	0.240 ± 0.005

更多實驗結果和討論可在相應論文 "Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing", CVPR'23 中找到。

侷限性

語言限制：該模型使用英語語料庫開發，因此可視為僅支持英語。
數據侷限性：訓練數據集僅包含從重症監護室（ICU）獲取的醫學圖像和報告，其中縱向圖像通常在數小時或最多幾天內收集。因此，在分析長時間（例如數年）獲取的連續圖像時，由於掃描之間觀察到顯著的解剖學變化，模型性能可能會下降。

🔧 技術細節

文檔未提及具體技術細節，暫無法提供。

📄 許可證

本項目採用 MIT 許可證。

🔗 引用

相應的論文已被接受在 計算機視覺與模式識別會議 (CVPR) 2023 上展示。

@misc{https://doi.org/10.48550/arXiv.2301.04558,
  doi = {10.48550/ARXIV.2301.04558},
  url = {https://arxiv.org/abs/2301.04558},
  author = {Bannur, Shruthi and Hyland, Stephanie and Liu, Qianchu and Perez-Garcia, Fernando and Ilse, Maximilian and Castro, Daniel C and Boecking, Benedikt and Sharma, Harshita and Bouzid, Kenza and Thieme, Anja and Schwaighofer, Anton and Wetscherek, Maria and Lungren, Matthew P and Nori, Aditya and Alvarez-Valle, Javier and Oktay, Ozan},
  title = {Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing},
  publisher = {arXiv},
  year = {2023},
}

⚠️ 重要提示

本模型僅用於（I）未來視覺語言處理研究和（II）復現參考論文中報告的實驗結果。
模型的任何部署用例（商業或其他）目前不在範圍內。儘管我們使用了廣泛的公開研究基準對模型進行了評估，但模型和評估並非用於部署用例。在前所未有的情況下，模型可能會做出不準確的預測並顯示出侷限性，這可能需要額外的緩解策略。因此，我們不建議將該模型用於自動診斷或醫療設備。更多詳情請參考相關論文。