Clip-RSICD Open-Source Remote Sensing Image Model - Freely Improve Zero-Shot Classification and Image Retrieval Capabilities

Clip Rsicd

Developed by flax-community

A remote sensing image-specific model fine-tuned based on OpenAI CLIP, enhancing zero-shot classification and image retrieval capabilities

Text-to-Image #Remote Sensing Image Retrieval #Zero-shot Classification #Multimodal Contrastive Learning

Downloads 146

Release Time : 3/2/2022

Model Overview

This model is optimized for remote sensing images, excelling in zero-shot image classification, text-to-image, and image-to-image retrieval tasks

Model Features

Optimized for Remote Sensing Images

Fine-tuned on datasets like RSICD, significantly improving understanding of aerial/satellite images

Zero-shot Classification Capability

Performs image classification on new categories without fine-tuning, achieving 84.3% Top-1 accuracy

Cross-modal Retrieval

Supports bidirectional retrieval between text-to-image and image-to-image

Efficient Training

Training accelerated with TPU-v3-8, complete training scripts and logs provided

Model Capabilities

Zero-shot image classification

Text-to-image retrieval

Image-to-image retrieval

Remote sensing image understanding

Use Cases

Research Applications

Computer Vision Research

Investigating the robustness of zero-shot learning and cross-modal representations

Industry Applications

Environmental Monitoring

Automatically identifying changes in ecological areas like forests and water bodies

Urban Planning

Classifying urban functional zones such as residential and commercial areas

47% accuracy improvement over original CLIP

Disaster Assessment

Rapid retrieval of disaster-affected area images

🚀 Model Card: clip-rsicd

This model is a fine - tuned version of CLIP by OpenAI. It aims to enhance zero - shot image classification, text - to - image, and image - to - image retrieval, specifically for remote sensing images.

🚀 Quick Start

You can start using the clip-rsicd model with the following code example:

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("flax-community/clip-rsicd")
processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd")

url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
image = Image.open(requests.get(url, stream=True).raw)

labels = ["residential area", "playground", "stadium", "forrest", "airport"]
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
for l, p in zip(labels, probs[0]):
    print(f"{l:<16} {p:.4f}")

Try it on colab

✨ Features

Designed to improve zero - shot image classification, text - to - image, and image - to - image retrieval on remote sensing images.
Released several checkpoints for performance evaluation.

📦 Installation

To reproduce the fine - tuning procedure, you can use the released script. The model was trained using a batch size of 1024, an adafactor optimizer with linear warm - up and decay, and a peak learning rate of 1e - 4 on 1 TPU - v3 - 8. The full log of the training run can be found on WandB.

💻 Usage Examples

Basic Usage

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("flax-community/clip-rsicd")
processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd")

url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
image = Image.open(requests.get(url, stream=True).raw)

labels = ["residential area", "playground", "stadium", "forrest", "airport"]
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
for l, p in zip(labels, probs[0]):
    print(f"{l:<16} {p:.4f}")

📚 Documentation

Fine - tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU

🔧 Technical Details

Model Date

July 2021

Model Type

The base model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

Model Version

We release several checkpoints for the clip-rsicd model. Refer to our github repo for performance metrics on zero - shot classification for each of those.

Demo

Check out the model's text - to - image and image - to - image capabilities using this demo.

📄 License

No license information provided in the original document.

Model Use

Intended Use

The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero - shot, arbitrary image classification.

In addition, we can imagine applications in defense and law enforcement, climate change and global warming, and even some consumer applications. A partial list of applications can be found here. In general, we think such models can be useful as digital assistants for humans engaged in searching through large collections of images.

We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.

Primary intended uses

The primary intended users of these models are AI researchers.

We primarily imagine the model will be used by researchers to better understand the robustness, generalization, and other capabilities, biases, and constraints of computer vision models.

Data

The model was trained on publicly available remote sensing image captions datasets, namely RSICD, UCM, and Sydney. More information on the datasets used can be found on our project page.

Performance and Limitations

Performance

Property	Details
Model Type	The base model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
Training Data	The model was trained on publicly available remote sensing image captions datasets: RSICD, UCM, and Sydney.

Model - name	k = 1	k = 3	k = 5	k = 10
original CLIP	0.572	0.745	0.837	0.939
clip - rsicd (this model)	0.843	0.958	0.977	0.993

Limitations

The model is fine - tuned on RSI data but can contain some biases and limitations of the original CLIP model. Refer to CLIP model card for details on those.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご