M3D-LaMed-Llama-2-7B Open Source Model - A Practical Choice for 3D Medical Imaging Analysis

M3D LaMed Llama 2 7B

Developed by GoodBaiBai88

M3D is a 3D medical image analysis technology based on multimodal large language models, including the M3D-Data dataset, M3D-LaMed model, and M3D-Bench evaluation benchmark.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #3D Medical Image Analysis #Multimodal Large Language Model #Medical Report Generation

Downloads 209

Release Time : 4/27/2024

Model Overview

M3D-LaMed is a versatile multimodal model equipped with the M3D-CLIP pre-trained visual encoder, supporting tasks such as image-text retrieval, report generation, visual question answering, localization, and segmentation.

Model Features

Multimodal 3D Medical Image Analysis

Supports processing 3D medical image data for multimodal medical image analysis.

Multifunctional Task Support

Capable of performing various tasks such as image-text retrieval, report generation, visual question answering, localization, and segmentation.

Large-scale Pre-training Data

Trained on the M3D-Data dataset, which includes 120,000 image-text pairs and 662,000 instruction-response pairs.

Model Capabilities

3D Medical Image Analysis

Medical Report Generation

Visual Question Answering

Organ Segmentation

Bounding Box Annotation

Image-Text Retrieval

Use Cases

Medical Imaging Diagnosis

Liver Region Segmentation

Identify and segment the liver region in 3D medical images.

Output segmentation mask.

Medical Report Generation

Automatically generate descriptive text of examination findings based on 3D medical images.

Generate natural language reports.

Medical Image Analysis

Organ Localization

Annotate the bounding box of a specific organ in the image.

Output bounding box coordinates.

Medical Image Question Answering

Answer professional questions about the content of 3D medical images.

Provide accurate medical explanations.

🚀 M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

M3D stands as the pioneering and comprehensive series of work on multi-modal large language models for 3D medical analysis. It offers a holistic solution to 3D medical analysis, including a large-scale dataset, versatile models, and a comprehensive evaluation benchmark.

Paper | Data | Code

✨ Features

M3D-Data: The largest-scale open-source 3D medical dataset, consisting of 120K image-text pairs and 662K instruction-response pairs.
M3D-LaMed: Versatile multi-modal models with an M3D-CLIP pretrained vision encoder. These models can handle tasks such as image-text retrieval, report generation, visual question answering, positioning, and segmentation.
M3D-Bench: The most comprehensive automatic evaluation benchmark, covering 8 tasks.

📄 License

This project is licensed under the Apache-2.0 license.

⚠️ Important Note

We found that the previous GoodBaiBai88/M3D-LaMed-Llama-2-7B model had problems in the segmentation task. We have fixed this problem and will re - release the new model in the next few days.

🚀 Quick Start

Here, we can easily use our model based on Hugging Face.

import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import simple_slice_viewer as ssv
import SimpleITK as sikt

device = torch.device('cuda') # 'cpu', 'cuda'
dtype = torch.bfloat16 # or bfloat16, float16, float32

model_name_or_path = 'GoodBaiBai88/M3D-LaMed-Llama-2-7B'
proj_out_num = 256

# Prepare your 3D medical image:
# 1. The image shape needs to be processed as 1*32*256*256, consider resize and other methods.
# 2. The image needs to be normalized to 0-1, consider Min-Max Normalization.
# 3. The image format needs to be converted to .npy 
# 4. Although we did not train on 2D images, in theory, the 2D image can be interpolated to the shape of 1*32*256*256 for input.
image_path = "./Data/data/examples/example_01.npy"

model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    torch_dtype=dtype,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    model_max_length=512,
    padding_side="right",
    use_fast=False,
    trust_remote_code=True
)

model = model.to(device=device)

# question = "Can you provide a caption consists of findings for this medical image?"
question = "What is liver in this image? Please output the segmentation mask."
# question = "What is liver in this image? Please output the box."

image_tokens = "<im_patch>" * proj_out_num
input_txt = image_tokens + question
input_id = tokenizer(input_txt, return_tensors="pt")['input_ids'].to(device=device)

image_np = np.load(image_path)
image_pt = torch.from_numpy(image_np).unsqueeze(0).to(dtype=dtype, device=device)

# generation = model.generate(image_pt, input_id, max_new_tokens=256, do_sample=True, top_p=0.9, temperature=1.0)
generation, seg_logit = model.generate(image_pt, input_id, seg_enable=True, max_new_tokens=256, do_sample=True, top_p=0.9, temperature=1.0)

generated_texts = tokenizer.batch_decode(generation, skip_special_tokens=True)
seg_mask = (torch.sigmoid(seg_logit) > 0.5) * 1.0

print('question', question)
print('generated_texts', generated_texts[0])

image = sikt.GetImageFromArray(image_np)
ssv.display(image)
seg = sikt.GetImageFromArray(seg_mask.cpu().numpy()[0])
ssv.display(seg)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご