đ Model Card: clip-rsicd
This model is a fine - tuned version of CLIP by OpenAI. It aims to enhance zero - shot image classification, text - to - image, and image - to - image retrieval, specifically for remote sensing images.
đ Quick Start
You can start using the clip-rsicd
model with the following code example:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("flax-community/clip-rsicd")
processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd")
url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["residential area", "playground", "stadium", "forrest", "airport"]
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
for l, p in zip(labels, probs[0]):
print(f"{l:<16} {p:.4f}")
Try it on colab
⨠Features
- Designed to improve zero - shot image classification, text - to - image, and image - to - image retrieval on remote sensing images.
- Released several checkpoints for performance evaluation.
đĻ Installation
To reproduce the fine - tuning procedure, you can use the released script. The model was trained using a batch size of 1024, an adafactor optimizer with linear warm - up and decay, and a peak learning rate of 1e - 4 on 1 TPU - v3 - 8. The full log of the training run can be found on WandB.
đģ Usage Examples
Basic Usage
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("flax-community/clip-rsicd")
processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd")
url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["residential area", "playground", "stadium", "forrest", "airport"]
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
for l, p in zip(labels, probs[0]):
print(f"{l:<16} {p:.4f}")
đ Documentation
đ§ Technical Details
Model Date
July 2021
Model Type
The base model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
Model Version
We release several checkpoints for the clip-rsicd
model. Refer to our github repo for performance metrics on zero - shot classification for each of those.
Demo
Check out the model's text - to - image and image - to - image capabilities using this demo.
đ License
No license information provided in the original document.
Model Use
Intended Use
The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero - shot, arbitrary image classification.
In addition, we can imagine applications in defense and law enforcement, climate change and global warming, and even some consumer applications. A partial list of applications can be found here. In general, we think such models can be useful as digital assistants for humans engaged in searching through large collections of images.
We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
Primary intended uses
The primary intended users of these models are AI researchers.
We primarily imagine the model will be used by researchers to better understand the robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
Data
The model was trained on publicly available remote sensing image captions datasets, namely RSICD, UCM, and Sydney. More information on the datasets used can be found on our project page.
Performance and Limitations
Performance
Property |
Details |
Model Type |
The base model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. |
Training Data |
The model was trained on publicly available remote sensing image captions datasets: RSICD, UCM, and Sydney. |
Model - name |
k = 1 |
k = 3 |
k = 5 |
k = 10 |
original CLIP |
0.572 |
0.745 |
0.837 |
0.939 |
clip - rsicd (this model) |
0.843 |
0.958 |
0.977 |
0.993 |
Limitations
The model is fine - tuned on RSI data but can contain some biases and limitations of the original CLIP model. Refer to CLIP model card for details on those.