đ Model Card for PubMedCLIP
PubMedCLIP is a fine - tuned version of CLIP tailored for the medical domain, offering enhanced performance in medical image - text understanding.
đ Quick Start
PubMedCLIP is a fine - tuned version of CLIP for the medical domain.
⨠Features
- Fine - tuned for the medical domain, leveraging the power of CLIP in a specialized area.
- Trained on a large - scale multimodal medical imaging dataset, the ROCO dataset.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
import requests
from PIL import Image
import matplotlib.pyplot as plt
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("flaviagiammarino/pubmed-clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("flaviagiammarino/pubmed-clip-vit-base-patch32")
url = "https://huggingface.co/flaviagiammarino/pubmed-clip-vit-base-patch32/resolve/main/scripts/input.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
text = ["Chest X-Ray", "Brain MRI", "Abdominal CT Scan"]
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)
probs = model(**inputs).logits_per_image.softmax(dim=1).squeeze()
plt.subplots()
plt.imshow(image)
plt.title("".join([x[0] + ": " + x[1] + "\n" for x in zip(text, [format(prob, ".4%") for prob in probs])]))
plt.axis("off")
plt.tight_layout()
plt.show()

đ Documentation
Model Description
PubMedCLIP was trained on the Radiology Objects in COntext (ROCO) dataset, a large - scale multimodal medical imaging dataset.
The ROCO dataset includes diverse imaging modalities (such as X - Ray, MRI, ultrasound, fluoroscopy, etc.) from various human body regions (such as head, spine, chest, abdomen, etc.)
captured from open - access PubMed articles.
PubMedCLIP was trained for 50 epochs with a batch size of 64 using the Adam optimizer with a learning rate of 10â5.
The authors have released three different pre - trained models at this [link](https://1drv.ms/u/s!ApXgPqe9kykTgwD4Np3 - f7ODAot8?e=zLVlJ2)
which use ResNet - 50, ResNet - 50x4 and ViT32 as image encoders. This repository includes only the ViT32 variant of the PubMedCLIP model.
đ§ Technical Details
PubMedCLIP was trained on the ROCO dataset, a large - scale multimodal medical imaging dataset. The training was carried out for 50 epochs with a batch size of 64. The Adam optimizer was used with a learning rate of 10â5. Three different pre - trained models were released, using ResNet - 50, ResNet - 50x4 and ViT32 as image encoders, and this repository only contains the ViT32 variant.
đ License
The authors have released the model code and pre - trained checkpoints under the MIT License.
Citation Information
@article{eslami2021does,
title={Does clip benefit visual question answering in the medical domain as much as it does in the general domain?},
author={Eslami, Sedigheh and de Melo, Gerard and Meinel, Christoph},
journal={arXiv preprint arXiv:2112.13906},
year={2021}
}