M3D-CLIP Open-source 3D Medical Imaging Model - Achieving Visual and Linguistic Alignment for Auxiliary Diagnosis

M3D CLIP

Developed by GoodBaiBai88

M3D-CLIP is a CLIP model specifically designed for 3D medical imaging, achieving visual and language alignment through contrastive loss.

Multimodal Alignment

Transformers

Open Source License:Apache-2.0 #3D Medical CLIP #Cross-modal Retrieval #Medical Image Analysis

Downloads 2,962

Release Time : 4/25/2024

Model Overview

M3D-CLIP is a vision-language model based on the 3D ViT architecture, specifically designed for cross-modal retrieval and aligned feature extraction between 3D medical images and text.

Model Features

Specialized for 3D Medical Imaging

Designed specifically for 3D medical imaging, using a 3D ViT architecture to process 3D images of size 32*256*256.

Cross-modal Alignment

Achieves semantic alignment between 3D medical images and text through contrastive loss.

Strong Representation Features

Provides aligned strong representation features for downstream tasks.

Pre-training Advantage

The text-aligned visual encoder can serve as a high-quality pre-trained model for vision/multimodal tasks.

Model Capabilities

3D medical image feature extraction

Cross-modal retrieval for medical text and images

Semantic understanding of medical images

Multimodal representation learning

Use Cases

Medical Image Analysis

Medical Image Retrieval

Retrieve relevant 3D medical images based on text descriptions.

Efficient and accurate cross-modal retrieval capability.

Medical Report Generation

Generate descriptive text for 3D medical images.

Medical Image Classification

Perform image classification using aligned features.

Medical Research

Medical Knowledge Mining

Discover associative knowledge from large-scale medical image and text data.

🚀 M3D-CLIP

M3D-CLIP is a 3D medical CLIP model that aligns vision and language, offering powerful features for 3D medical image and text retrieval tasks.

✨ Features

M3D-CLIP is part of the M3D series.
It aligns vision and language via contrastive loss on the M3D-Cap dataset.
The vision encoder uses 3D ViT with an image size of 32256256 and a patch size of 41616.
The language encoder initializes with a pre - trained BERT.
It can be used for 3D medical image and text retrieval tasks.
It provides aligned and powerful image and text features for downstream tasks.
Text - aligned visual encoders serve as excellent pre - trained models for visual and multi - modal tasks.

comparison

🚀 Quick Start

device = torch.device("cuda") # or cpu

tokenizer = AutoTokenizer.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    model_max_length=512,
    padding_side="right",
    use_fast=False
)
model = AutoModel.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    trust_remote_code=True
)
model = model.to(device=device)

# Prepare your 3D medical image:
# 1. The image shape needs to be processed as 1*32*256*256, considering resize and other methods.
# 2. The image needs to be normalized to 0-1, considering Min-Max Normalization.
# 3. The image format needs to be converted to .npy 
# 4. Although we did not train on 2D images, in theory, the 2D image can be interpolated to the shape of 1*32*256*256 for input.
    
image_path = ""
input_txt = ""

text_tensor = tokenizer(input_txt, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
input_id = text_tensor["input_ids"].to(device=device)
attention_mask = text_tensor["attention_mask"].to(device=device)
image = np.load(image_path).to(device=device)

with torch.inference_mode():
    image_features = model.encode_image(image)[:, 0]
    text_features = model.encode_text(input_id, attention_mask)[:, 0]

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Documentation

Citation

If you find our work helpful, please consider citing the following work:

@misc{bai2024m3d,
      title={M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models}, 
      author={Fan Bai and Yuxin Du and Tiejun Huang and Max Q. -H. Meng and Bo Zhao},
      year={2024},
      eprint={2404.00578},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📦 Information Table

Property	Details
Model Type	3D medical CLIP model
Training Data	M3D-Cap dataset
Metrics	accuracy
Pipeline Tag	image-feature-extraction

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご