Git-RSCLIP Open-Source Vision-Language Model - Powering Multimodal Understanding Tasks for Remote Sensing Images

Git RSCLIP

Developed by lcybuaa

Git-RSCLIP is a vision-language model pretrained on the Git-10M dataset, specializing in multimodal understanding of remote sensing images.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Remote Sensing Image-Text Retrieval #Zero-Shot Classification #256x256 Resolution

Downloads 59.37k

Release Time : 3/3/2025

Model Overview

This model is a vision-language model specifically designed for tasks involving the association of remote sensing images and text, supporting functions such as zero-shot image classification and image-text retrieval.

Model Features

Global-Scale Remote Sensing Dataset

Pretrained on the Git-10M dataset, which contains 10 million remote sensing image-text pairs, covering a global scope.

High-Resolution Processing

Supports image processing at 256x256 resolution, suitable for high-precision requirements of remote sensing images.

Zero-Shot Learning Capability

Can be directly applied to zero-shot image classification and image-text retrieval tasks without fine-tuning.

Model Capabilities

Zero-Shot Image Classification

Image-Text Retrieval

Remote Sensing Image Understanding

Use Cases

Remote Sensing Image Analysis

Remote Sensing River Image Classification

Identify rivers and other geographical features in remote sensing images.

High-accuracy zero-shot classification capability

House and Road Detection

Detect artificial structures such as houses and roads from remote sensing images.

Supports multi-label classification

🚀 Git-RSCLIP

Git-RSCLIP is a pre - trained model on the Git - 10M dataset for remote sensing image - text related tasks, offering capabilities in zero - shot image classification and image - text retrieval.

🚀 Quick Start

[Git-RSCLIP] is pre - trained on the Git - 10M dataset (a global - scale remote sensing image - text pair dataset, consisting of 10 million image - text pairs) at size 256x256, first released in [this repository](https://github.com/chen - yang - liu/Text2Earth). It employs a similar structure to [[google/siglip - large - patch16 - 256](https://huggingface.co/google/siglip - large - patch16 - 256)].

This is a large version, the base version is here: [[Git - RSCLIP - base](https://huggingface.co/lcybuaa/Git - RSCLIP - base)]

✨ Features

You can use the raw model for tasks like zero - shot image classification and image - text retrieval.

💻 Usage Examples

Basic Usage

Use Git - RSCLIP to get image features

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("lcybuaa/Git-RSCLIP")
processor = AutoProcessor.from_pretrained("lcybuaa/Git-RSCLIP")

url = "https://github.com/Chen-Yang-Liu/PromptCC/blob/main/Example/B/train_000051.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
  image_features = model.get_image_features(**inputs)

zero - shot image classification

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("lcybuaa/Git-RSCLIP")
processor = AutoProcessor.from_pretrained("lcybuaa/Git-RSCLIP")

url = "https://github.com/Chen-Yang-Liu/PromptCC/blob/main/Example/B/train_000051.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a remote sensing image of river", "a remote sensing image of houses and roads"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
top5_indices = torch.argsort(probs, descending=True)[:, :5].cpu().numpy()
top1_indices = top5_indices[:, 0]
print(f"the image 0 is '{top1_indices[0]}'")

For more code examples, we refer to the documentation.

🔧 Technical Details

Training Data

Git - RSCLIP is pre - trained on the Git - 10M dataset (a global - scale remote sensing image - text pair dataset, consisting of 10 million image - text pairs) [(Liu et al., 2024)](https://github.com/chen - yang - liu/Text2Earth).

Preprocessing

Images are resized/rescaled to the same resolution (256x256) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Texts are tokenized and padded to the same length (64 tokens).

📚 Documentation

Evaluation of Git - RSCLIP compared to other CLIP is shown below (taken from the paper).

drawing

📄 License

The model is licensed under the Apache - 2.0 license.

BibTeX entry and citation info

@ARTICLE{10988859,
  author={Liu, Chenyang and Chen, Keyan and Zhao, Rui and Zou, Zhengxia and Shi, Zhenwei},
  journal={IEEE Geoscience and Remote Sensing Magazine}, 
  title={Text2Earth: Unlocking text - driven remote sensing image generation with a global - scale dataset and a foundation model}, 
  year={2025},
  volume={},
  number={},
  pages={2 - 23},
  doi={10.1109/MGRS.2025.3560455}}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご