M3D - CLIPオープンソース3D医療画像モデル - 視覚と言語のアライメントを実現し、診断を支援する

ホーム

M3D CLIP

GoodBaiBai88によって開発

M3D-CLIPは3D医用画像専用に設計されたCLIPモデルで、コントラスト損失により視覚と言語のアラインメントを実現します。

マルチモーダルアライメント

Transformers

オープンソースライセンス:Apache-2.0 #3D医学CLIP #クロスモーダル検索 #医用画像分析

ダウンロード数 2,962

リリース時間 : 4/25/2024

モデル概要

M3D-CLIPは3D ViTアーキテクチャに基づく視覚-言語モデルで、3D医用画像とテキストのクロスモーダル検索と特徴抽出に特化しています。

モデル特徴

3D医用画像専用

3D医用画像専用に設計され、3D ViTアーキテクチャを使用して32*256*256サイズの3D画像を処理します

クロスモーダルアラインメント

コントラスト損失により3D医用画像とテキストの意味的アラインメントを実現

強力な特徴表現

下流タスク向けにアラインメントされた強力な画像テキスト特徴を提供

事前学習の利点

テキストアラインメントされた視覚エンコーダは、視覚/マルチモーダルタスクの優れた事前学習モデルとして使用可能

モデル能力

3D医用画像特徴抽出

医用画像テキストクロスモーダル検索

医用画像意味理解

マルチモーダル表現学習

使用事例

医用画像分析

医用画像検索

テキスト記述に基づいて関連する3D医用画像を検索

効率的で正確なクロスモーダル検索能力

医用レポート生成

3D医用画像に対して記述的テキストを生成

医用画像分類

アラインメントされた特徴を利用して画像分類を実施

医学研究

医学知識マイニング

大規模な医用画像とテキストデータから関連知識を発見

🚀 M3D - CLIP

M3D - CLIPは[M3D](https://github.com/BAAI - DCAI/M3D)シリーズの作品の一つです。これは、[M3D - Cap](https://huggingface.co/datasets/GoodBaiBai88/M3D - Cap)データセット上で対照損失を用いてビジョンと言語をアラインさせる3D医用CLIPモデルです。ビジョンエンコーダは、画像サイズが32256256、パッチサイズが41616の3D ViTを使用しています。言語エンコーダは、事前学習済みのBERTを初期化に利用しています。

このモデルの用途は以下の通りです。

3D医用画像とテキストの検索タスク。
下流タスクに向けた、アラインされた強力な画像とテキストの特徴量。
テキストにアラインされたビジョンエンコーダは、ビジョンおよびマルチモーダルタスクに優れた事前学習モデルです。

comparison

🚀 クイックスタート

device = torch.device("cuda") # or cpu

tokenizer = AutoTokenizer.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    model_max_length=512,
    padding_side="right",
    use_fast=False
)
model = AutoModel.from_pretrained(
    "GoodBaiBai88/M3D-CLIP",
    trust_remote_code=True
)
model = model.to(device=device)

# Prepare your 3D medical image:
# 1. The image shape needs to be processed as 1*32*256*256, considering resize and other methods.
# 2. The image needs to be normalized to 0-1, considering Min-Max Normalization.
# 3. The image format needs to be converted to .npy 
# 4. Although we did not train on 2D images, in theory, the 2D image can be interpolated to the shape of 1*32*256*256 for input.
    
image_path = ""
input_txt = ""

text_tensor = tokenizer(input_txt, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
input_id = text_tensor["input_ids"].to(device=device)
attention_mask = text_tensor["attention_mask"].to(device=device)
image = np.load(image_path).to(device=device)

with torch.inference_mode():
    image_features = model.encode_image(image)[:, 0]
    text_features = model.encode_text(input_id, attention_mask)[:, 0]

📄 ライセンス

このプロジェクトはApache-2.0ライセンスの下で公開されています。

📚 引用

もしこの研究が役に立ったと感じた場合は、以下の文献を引用してください。

@misc{bai2024m3d,
      title={M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models}, 
      author={Fan Bai and Yuxin Du and Tiejun Huang and Max Q. -H. Meng and Bo Zhao},
      year={2024},
      eprint={2404.00578},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}