M3D-CLIP開源3D醫學影像模型 - 實現視覺與語言對齊輔助診斷

首頁

M3D CLIP

由GoodBaiBai88開發

M3D-CLIP是專為3D醫學影像設計的CLIP模型，通過對比損失實現視覺與語言的對齊。

多模態對齊

Transformers

開源協議:Apache-2.0 #3D醫學CLIP #跨模態檢索 #醫學影像分析

下載量 2,962

發布時間 : 4/25/2024

模型概述

M3D-CLIP是一個基於3D ViT架構的視覺-語言模型，專門用於3D醫學影像與文本的跨模態檢索和對齊特徵提取。

模型特點

3D醫學影像專用

專為3D醫學影像設計，採用3D ViT架構處理32*256*256尺寸的3D圖像

跨模態對齊

通過對比損失實現3D醫學影像與文本的語義對齊

強表徵特徵

為下游任務提供對齊的強表徵圖文特徵

預訓練優勢

文本對齊的視覺編碼器可作為視覺/多模態任務的優質預訓練模型

模型能力

3D醫學影像特徵提取

醫學圖文跨模態檢索

醫學影像語義理解

多模態表徵學習

使用案例

醫學影像分析

醫學影像檢索

根據文本描述檢索相關3D醫學影像

高效準確的跨模態檢索能力

醫學報告生成

為3D醫學影像生成描述性文本

醫學影像分類

利用對齊特徵進行影像分類

醫學研究

醫學知識挖掘

從大規模醫學影像和文本數據中發現關聯知識

🚀 M3D-CLIP

M3D-CLIP是一個3D醫學CLIP模型，屬於M3D系列的研究成果。它通過在M3D-Cap數據集上使用對比損失來對齊視覺和語言，能夠為3D醫學圖像和文本檢索任務提供強大支持，同時其提取的圖像和文本特徵也可用於下游任務。

🚀 快速開始

device = torch.device("cuda") # or cpu

tokenizer = AutoTokenizer.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    model_max_length=512,
    padding_side="right",
    use_fast=False
)
model = AutoModel.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    trust_remote_code=True
)
model = model.to(device=device)

# Prepare your 3D medical image:
# 1. The image shape needs to be processed as 1*32*256*256, considering resize and other methods.
# 2. The image needs to be normalized to 0-1, considering Min-Max Normalization.
# 3. The image format needs to be converted to .npy 
# 4. Although we did not train on 2D images, in theory, the 2D image can be interpolated to the shape of 1*32*256*256 for input.
    
image_path = ""
input_txt = ""

text_tensor = tokenizer(input_txt, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
input_id = text_tensor["input_ids"].to(device=device)
attention_mask = text_tensor["attention_mask"].to(device=device)
image = np.load(image_path).to(device=device)

with torch.inference_mode():
    image_features = model.encode_image(image)[:, 0]
    text_features = model.encode_text(input_id, attention_mask)[:, 0]

✨ 主要特性

3D醫學圖像和文本檢索任務：可有效應用於3D醫學圖像與文本的檢索工作。
強大的圖像和文本特徵：為下游任務提供對齊且強大的圖像和文本特徵。
優秀的預訓練模型：文本對齊的視覺編碼器是視覺和多模態任務的優秀預訓練模型。

comparison

🔧 技術細節

M3D-CLIP的視覺編碼器使用3D ViT，處理的圖像大小為32256256，補丁大小為41616；語言編碼器則使用預訓練的BERT進行初始化。該模型通過在M3D-Cap數據集上的對比損失來對齊視覺和語言。

📄 許可證

本項目採用Apache-2.0許可證。

📚 詳細文檔

引用

如果您覺得本項目的工作對您有幫助，請考慮引用以下文獻：

@misc{bai2024m3d,
      title={M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models}, 
      author={Fan Bai and Yuxin Du and Tiejun Huang and Max Q. -H. Meng and Bo Zhao},
      year={2024},
      eprint={2404.00578},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}