ConceptCLIP Open-Source Medical Vision-Language Model - Multimodal Image Processing, Robust Multi-Task Operation

Conceptclip

Developed by JerrryNie

ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts, suitable for various medical imaging modalities, capable of achieving robust performance across multiple medical imaging tasks.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Medical Multimodal #Concept Enhancement #Zero-shot Classification

Downloads 836

Release Time : 1/22/2025

Model Overview

This model employs a concept-enhanced language-image alignment mechanism, making it suitable for tasks such as medical image analysis, classification, and cross-modal retrieval.

Model Features

Medical Concept Enhancement

Enhances vision-language alignment capability through large-scale medical concept annotation.

Multimodal Support

Supports various medical imaging modalities such as CT, MRI, and X-ray.

Zero-shot Learning

Performs well on new medical tasks without fine-tuning.

Explainability

Provides interpretable predictions through concept bottlenecks.

Model Capabilities

Medical image classification

Cross-modal retrieval

Concept annotation

Feature extraction

Zero-shot learning

Use Cases

Medical Image Analysis

Chest X-ray Classification

Zero-shot classification of chest X-ray images

Brain MRI Analysis

Identifying abnormal regions in brain MRI scans

Clinical Decision Support

Diagnostic Assistance

Provides image analysis references for doctors

Medical Education

Teaching Tool

Used for medical imaging teaching and training

🚀 ConceptCLIP

ConceptCLIP is a large - scale vision - language pre - training model enhanced with medical concepts. It can handle diverse medical image modalities and achieve robust performance in multiple medical imaging tasks through concept - enhanced language - image alignment.

🚀 Quick Start

from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)

image = Image.open('example_data/chest_X-ray.jpg').convert('RGB')
labels = ['chest X-ray', 'brain MRI', 'skin lesion']
texts = [f'a medical image of {label}' for label in labels]

inputs = processor(
    images=image, 
    text=texts,
    return_tensors='pt',
    padding=True,
    truncation=True
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = (outputs['logit_scale'] * outputs['image_features'] @ outputs['text_features'].t()).softmax(dim=-1)[0]

print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})

✨ Features

Diverse Medical Tasks: Enables zero - shot medical image classification, cross - modal retrieval, zero - shot concept annotation, and feature extraction for whole - slide image analysis and medical report generation.
Downstream Adaptability: Can be fine - tuned for specific medical imaging tasks, used as a concept bottleneck model for explanation, integrated into clinical decision support systems, and applied in medical education and training tools.

📦 Installation

The provided code uses the transformers library. You can install it using the following command:

pip install transformers

📚 Documentation

Model Details

Model Description

Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
Model type: Vision - Language Pre - trained Model (Medical Specialized)
Language(s): English (text), Multi - modal (medical imaging)
License: MIT
Finetuned from model: Based on OpenCLIP

Model Sources

Repository: GitHub Project
Paper: An Explainable Biomedical Foundation Model via Large - Scale Concept - Enhanced Vision - Language Pre - training
Demo: Hugging Face Model Hub

Uses

Direct Use

Zero - shot medical image classification
Cross - modal retrieval
Zero - shot concept annotation
Extract features for whole - slide image analysis
Extract features for medical report generation

Downstream Use

Fine - tuning for specific medical imaging tasks (CT, MRI, X - ray analysis) for classification, and visual question answering
Concept bottleneck model for explanation
Integration into clinical decision support systems
Medical education and training tools

Out - of - Scope Use

Direct clinical diagnosis without clinical validation
Non - medical image analysis
General purpose vision tasks outside medical domain

Bias, Risks, and Limitations

Trained primarily on medical imaging data which may contain demographic biases.
Performance may vary across different medical imaging modalities.
Should not be used as sole diagnostic tool without human oversight.

Recommendations

Validate outputs with clinical experts before medical decision making.
Fine - tune on domain - specific data for specialized applications.
Conduct bias analysis when deploying in new clinical environments.

Training Details

Training Data

Large - scale medical image - text pairs with concept information

Training Procedure

Built on OpenCLIP architecture with medical concept integration.
Pre - training with image - text alignment (IT - Align) and region - concept alignment (RC - Align) objectives

Training Hyperparameters

Property	Details
Base architecture	SigLIP - ViT - 400M - 16 + PubMedBERT
Training regime	Mixed precision training
Batch size	12,288 w/o PC - Align, 6,144 w/ PC - Align
Learning rate	5e - 4 w/o PC - Align, 3e - 4 w/ PC - Align

Evaluation

Testing Data & Metrics

Testing Data

Evaluated on multiple open - sourced medical imaging benchmarks including medical image diagnosis, cross - modal retrieval, medical visual question answering, medical report generation, whole - slide image analysis, and explainable AI

Citation

BibTeX:

@article{nie2025conceptclip,
  title={An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training},
  author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Cai, Zhiyuan and Wang, Hongmei and Wang, Xi and Luo, Luyang and Wu, Mingxiang and Wu, Xian and Chan, Ronald Cheong Kin and Lau, Yuk Ming and Zheng, Yefeng and Rajpurkar, Pranav and Chen, Hao},
  journal={arXiv preprint arXiv:2501.15579},
  year={2025}
}

APA: Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., Cai, Z., Wang, H., Wang, X., Luo, L., Wu, M., Wu, X., Chan, R. C. K., Lau, Y. M., Zheng, Y., Rajpurkar, P., & Chen, H. (2025). An Explainable Biomedical Foundation Model via Large - Scale Concept - Enhanced Vision - Language Pre - training. arXiv preprint arXiv:2501.15579.

Model Card Contact

Yuxiang Nie: ynieae@connect.ust.hk

📄 License

This model is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご