M3D-CLIP开源3D医学影像模型 - 实现视觉与语言对齐辅助诊断

Home

M3D CLIP

Developed by GoodBaiBai88

M3D-CLIP是专为3D医学影像设计的CLIP模型，通过对比损失实现视觉与语言的对齐。

多模态对齐

Transformers

Open Source License:Apache-2.0 #3D医学CLIP #跨模态检索 #医学影像分析

Downloads 2,962

Release Time : 4/25/2024

Model Overview

M3D-CLIP是一个基于3D ViT架构的视觉-语言模型，专门用于3D医学影像与文本的跨模态检索和对齐特征提取。

Model Features

3D医学影像专用

专为3D医学影像设计，采用3D ViT架构处理32*256*256尺寸的3D图像

跨模态对齐

通过对比损失实现3D医学影像与文本的语义对齐

强表征特征

为下游任务提供对齐的强表征图文特征

预训练优势

文本对齐的视觉编码器可作为视觉/多模态任务的优质预训练模型

Model Capabilities

3D医学影像特征提取

医学图文跨模态检索

医学影像语义理解

多模态表征学习

Use Cases

医学影像分析

医学影像检索

根据文本描述检索相关3D医学影像

高效准确的跨模态检索能力

医学报告生成

为3D医学影像生成描述性文本

医学影像分类

利用对齐特征进行影像分类

医学研究

医学知识挖掘

从大规模医学影像和文本数据中发现关联知识

🚀 M3D-CLIP

M3D-CLIP是一个3D医学CLIP模型，属于M3D系列的研究成果。它通过在M3D-Cap数据集上使用对比损失来对齐视觉和语言，能够为3D医学图像和文本检索任务提供强大支持，同时其提取的图像和文本特征也可用于下游任务。

🚀 快速开始

device = torch.device("cuda") # or cpu

tokenizer = AutoTokenizer.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    model_max_length=512,
    padding_side="right",
    use_fast=False
)
model = AutoModel.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    trust_remote_code=True
)
model = model.to(device=device)

# Prepare your 3D medical image:
# 1. The image shape needs to be processed as 1*32*256*256, considering resize and other methods.
# 2. The image needs to be normalized to 0-1, considering Min-Max Normalization.
# 3. The image format needs to be converted to .npy 
# 4. Although we did not train on 2D images, in theory, the 2D image can be interpolated to the shape of 1*32*256*256 for input.
    
image_path = ""
input_txt = ""

text_tensor = tokenizer(input_txt, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
input_id = text_tensor["input_ids"].to(device=device)
attention_mask = text_tensor["attention_mask"].to(device=device)
image = np.load(image_path).to(device=device)

with torch.inference_mode():
    image_features = model.encode_image(image)[:, 0]
    text_features = model.encode_text(input_id, attention_mask)[:, 0]

✨ 主要特性

3D医学图像和文本检索任务：可有效应用于3D医学图像与文本的检索工作。
强大的图像和文本特征：为下游任务提供对齐且强大的图像和文本特征。
优秀的预训练模型：文本对齐的视觉编码器是视觉和多模态任务的优秀预训练模型。

comparison

🔧 技术细节

M3D-CLIP的视觉编码器使用3D ViT，处理的图像大小为32256256，补丁大小为41616；语言编码器则使用预训练的BERT进行初始化。该模型通过在M3D-Cap数据集上的对比损失来对齐视觉和语言。

📄 许可证

本项目采用Apache-2.0许可证。

📚 详细文档

引用

如果您觉得本项目的工作对您有帮助，请考虑引用以下文献：

@misc{bai2024m3d,
      title={M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models}, 
      author={Fan Bai and Yuxin Du and Tiejun Huang and Max Q. -H. Meng and Bo Zhao},
      year={2024},
      eprint={2404.00578},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}