Git-RSCLIP開源視覺-語言模型 - 助力遙感圖像多模態理解任務

首頁

Git RSCLIP

由lcybuaa開發

Git-RSCLIP是基於Git-10M數據集預訓練的視覺-語言模型，專注於遙感圖像的多模態理解。

文本生成圖像

Safetensors

開源協議:Apache-2.0 #遙感圖文檢索 #零樣本分類 #256x256分辨率

下載量 59.37k

發布時間 : 3/3/2025

模型概述

該模型是一個視覺-語言模型，專門用於處理遙感圖像與文本的關聯任務，支持零樣本圖像分類和圖文檢索等功能。

模型特點

全球規模遙感數據集

基於包含1000萬張遙感圖像-文本對的Git-10M數據集預訓練，覆蓋全球範圍。

高分辨率處理

支持256x256分辨率的圖像處理，適合遙感圖像的高精度需求。

零樣本學習能力

無需微調即可直接應用於零樣本圖像分類和圖文檢索任務。

模型能力

零樣本圖像分類

圖文檢索

遙感圖像理解

使用案例

遙感圖像分析

遙感河流圖像分類

識別遙感圖像中的河流與其他地理特徵。

高準確率的零樣本分類能力

房屋和道路檢測

從遙感圖像中檢測房屋和道路等人工建築。

支持多標籤分類

🚀 Git - RSCLIP

Git - RSCLIP是一個預訓練模型，在256x256尺寸的Git - 10M數據集（一個全球尺度的遙感圖像 - 文本對數據集，包含1000萬對圖像 - 文本）上進行預訓練。該模型首次發佈於此倉庫，採用了與[google/siglip-large-patch16-256]相似的結構。此為大版本，基礎版本可查看：[[Git - RSCLIP - base](https://huggingface.co/lcybuaa/Git - RSCLIP - base)]。

🚀 快速開始

你可以使用原始模型進行零樣本圖像分類和圖像 - 文本檢索等任務。

💻 使用示例

基礎用法

使用Git - RSCLIP獲取圖像特徵

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("lcybuaa/Git-RSCLIP")
processor = AutoProcessor.from_pretrained("lcybuaa/Git-RSCLIP")

url = "https://github.com/Chen-Yang-Liu/PromptCC/blob/main/Example/B/train_000051.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
  image_features = model.get_image_features(**inputs)

零樣本圖像分類

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("lcybuaa/Git-RSCLIP")
processor = AutoProcessor.from_pretrained("lcybuaa/Git-RSCLIP")

url = "https://github.com/Chen-Yang-Liu/PromptCC/blob/main/Example/B/train_000051.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a remote sensing image of river", "a remote sensing image of houses and roads"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
top5_indices = torch.argsort(probs, descending=True)[:, :5].cpu().numpy()
top1_indices = top5_indices[:, 0]
print(f"the image 0 is '{top1_indices[0]}'")

更多代碼示例可參考文檔。

🔧 技術細節

訓練數據

Git - RSCLIP在Git - 10M數據集（一個全球尺度的遙感圖像 - 文本對數據集，包含1000萬對圖像 - 文本）上進行預訓練[(Liu et al., 2024)](https://github.com/chen - yang - liu/Text2Earth)。

預處理

圖像：圖像被調整大小/縮放至相同分辨率（256x256），並在RGB通道上進行歸一化處理，均值為(0.5, 0.5, 0.5)，標準差為(0.5, 0.5, 0.5)。
文本：文本被分詞並填充至相同長度（64個標記）。

📚 詳細文檔

評估結果

Git - RSCLIP與其他CLIP模型的評估對比結果如下（取自論文）。評估結果

BibTeX引用和引用信息

@ARTICLE{10988859,
  author={Liu, Chenyang and Chen, Keyan and Zhao, Rui and Zou, Zhengxia and Shi, Zhenwei},
  journal={IEEE Geoscience and Remote Sensing Magazine}, 
  title={Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model}, 
  year={2025},
  volume={},
  number={},
  pages={2-23},
  doi={10.1109/MGRS.2025.3560455}}

📄 許可證

本項目採用Apache - 2.0許可證。

📋 其他信息

屬性	詳情
模型類型	適用於視覺、多模型、視覺 - 語言、遙感領域的文本到圖像模型
訓練數據	Git - 10M數據集（全球尺度的遙感圖像 - 文本對數據集，包含1000萬對圖像 - 文本）
基礎模型	google/siglip - large - patch16 - 256
任務標籤	文本到圖像