đ ConceptCLIP
ConceptCLIP is a large - scale vision - language pre - training model enhanced with medical concepts. It can handle diverse medical image modalities and achieve robust performance in multiple medical imaging tasks through concept - enhanced language - image alignment.
đ Quick Start
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
image = Image.open('example_data/chest_X-ray.jpg').convert('RGB')
labels = ['chest X-ray', 'brain MRI', 'skin lesion']
texts = [f'a medical image of {label}' for label in labels]
inputs = processor(
images=image,
text=texts,
return_tensors='pt',
padding=True,
truncation=True
).to(model.device)
with torch.no_grad():
outputs = model(**inputs)
logits = (outputs['logit_scale'] * outputs['image_features'] @ outputs['text_features'].t()).softmax(dim=-1)[0]
print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})
⨠Features
- Diverse Medical Tasks: Enables zero - shot medical image classification, cross - modal retrieval, zero - shot concept annotation, and feature extraction for whole - slide image analysis and medical report generation.
- Downstream Adaptability: Can be fine - tuned for specific medical imaging tasks, used as a concept bottleneck model for explanation, integrated into clinical decision support systems, and applied in medical education and training tools.
đĻ Installation
The provided code uses the transformers
library. You can install it using the following command:
pip install transformers
đ Documentation
Model Details
Model Description
- Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
- Model type: Vision - Language Pre - trained Model (Medical Specialized)
- Language(s): English (text), Multi - modal (medical imaging)
- License: MIT
- Finetuned from model: Based on OpenCLIP
Model Sources
Uses
Direct Use
- Zero - shot medical image classification
- Cross - modal retrieval
- Zero - shot concept annotation
- Extract features for whole - slide image analysis
- Extract features for medical report generation
Downstream Use
- Fine - tuning for specific medical imaging tasks (CT, MRI, X - ray analysis) for classification, and visual question answering
- Concept bottleneck model for explanation
- Integration into clinical decision support systems
- Medical education and training tools
Out - of - Scope Use
- Direct clinical diagnosis without clinical validation
- Non - medical image analysis
- General purpose vision tasks outside medical domain
Bias, Risks, and Limitations
- Trained primarily on medical imaging data which may contain demographic biases.
- Performance may vary across different medical imaging modalities.
- Should not be used as sole diagnostic tool without human oversight.
Recommendations
- Validate outputs with clinical experts before medical decision making.
- Fine - tune on domain - specific data for specialized applications.
- Conduct bias analysis when deploying in new clinical environments.
Training Details
Training Data
- Large - scale medical image - text pairs with concept information
Training Procedure
- Built on OpenCLIP architecture with medical concept integration.
- Pre - training with image - text alignment (IT - Align) and region - concept alignment (RC - Align) objectives
Training Hyperparameters
Property |
Details |
Base architecture |
SigLIP - ViT - 400M - 16 + PubMedBERT |
Training regime |
Mixed precision training |
Batch size |
12,288 w/o PC - Align, 6,144 w/ PC - Align |
Learning rate |
5e - 4 w/o PC - Align, 3e - 4 w/ PC - Align |
Evaluation
Testing Data & Metrics
Testing Data
- Evaluated on multiple open - sourced medical imaging benchmarks including medical image diagnosis, cross - modal retrieval, medical visual question answering, medical report generation, whole - slide image analysis, and explainable AI
Citation
BibTeX:
@article{nie2025conceptclip,
title={An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training},
author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Cai, Zhiyuan and Wang, Hongmei and Wang, Xi and Luo, Luyang and Wu, Mingxiang and Wu, Xian and Chan, Ronald Cheong Kin and Lau, Yuk Ming and Zheng, Yefeng and Rajpurkar, Pranav and Chen, Hao},
journal={arXiv preprint arXiv:2501.15579},
year={2025}
}
APA:
Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., Cai, Z., Wang, H., Wang, X., Luo, L., Wu, M., Wu, X., Chan, R. C. K., Lau, Y. M., Zheng, Y., Rajpurkar, P., & Chen, H. (2025). An Explainable Biomedical Foundation Model via Large - Scale Concept - Enhanced Vision - Language Pre - training. arXiv preprint arXiv:2501.15579.
Model Card Contact
Yuxiang Nie: ynieae@connect.ust.hk
đ License
This model is licensed under the MIT license.