RS-M-CLIP Open-Source Model - Supports 10 Languages and Solves Remote Sensing Image-Text Cross-Modal Tasks

RS M CLIP

Developed by joaodaniel

A multilingual vision-language pre-trained model for the remote sensing field, supporting image-text cross-modal tasks in 10 languages.

Image-to-Text

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Multilingual Remote Sensing #Zero-shot Classification #Cross-modal Retrieval

Downloads 248

Release Time : 11/5/2024

Model Overview

RS-M-CLIP is an improved model based on the CLIP architecture, specifically optimized for remote sensing image processing. Its performance is enhanced by integrating multilingual translation data and self-distillation methods. It supports tasks such as cross-modal retrieval and zero-shot image classification.

Model Features

Multilingual Support

Supports text input in 10 languages, including major European and Asian languages.

Optimized for Remote Sensing Field

Specifically trained for the characteristics of satellite/aerial images, performing excellently in remote sensing tasks.

Self-distillation Training

Adopts a self-supervised method that aligns local and global representations to improve model performance.

Model Capabilities

Multilingual Image Classification

Cross-modal Image Retrieval

Multilingual Text Retrieval

Zero-shot Learning

Use Cases

Geospatial Analysis

Satellite Image Classification

Perform zero-shot classification on satellite images, such as identifying targets like airplanes and buildings.

Accurately identified airplane images in the example.

Multilingual Image Retrieval

Retrieve relevant remote sensing images using queries in different languages.

Supports query input in 10 languages.

Urban Planning

Land Use Analysis

Identify land use types such as urban areas and green spaces.

🚀 RS-M-CLIP: Multilingual Vision-Language Pre-training for the Remote Sensing Domain

This repository is the official one for the paper “Multilingual Vision-Language Pre-training for the Remote Sensing Domain”. It presents a novel vision - and - language model, RS - M - CLIP, which achieves state - of - the - art results in various vision - and - language tasks in the remote sensing domain.

✨ Features

Multilingual Support: Supports multiple languages including English, Portuguese, Spanish, French, German, Dutch, Italian, Chinese, Korean, and Russian.
State - of - the - Art Performance: Achieves excellent results in tasks such as cross - modal and multilingual image - text retrieval, and zero - shot image classification.
Innovative Training Approach: Explores fine - tuning of a multilingual CLIP model and uses a self - supervised method based on aligning local and global representations from individual input images, along with the standard CLIP objective.

📦 Installation

The model can be loaded using the OpenCLIP library, which will load the weights stored in the Hugging Face Hub.

💻 Usage Examples

Basic Usage

To load the model:

import torch
import open_clip

model, preprocess, preprocess_val = open_clip.create_model_and_transforms('hf - hub:joaodaniel/RS - M - CLIP')
tokenizer = open_clip.get_tokenizer('hf - hub:joaodaniel/RS - M - CLIP')

Advanced Usage

Image Classification (in English)

model = model.eval()

from PIL import Image
image = preprocess(Image.open('figs/airplane_004.jpg')).unsqueeze(0)

text_queries = [ 
    "A residential area with houses.", 
    "Blocks of buildings can be seen in the factory .", 
    "Dense residential areas on both sides of the road .", 
    "Many airplanes in the open area.",
    "A cute cat",
    ]
text = tokenizer(text_queries)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]

for query, prob in zip(text_queries, text_probs):
    print(f"{query:<40} {prob * 100:5.1f}%")

Output:

A residential area with houses.                                0.0%
Blocks of buildings can be seen in the factory .               0.0%
Dense residential areas on both sides of the road .            0.0%
Many airplanes in the open area.                             100.0%
A cute cat                                                     0.0%

Figure with four airplanes parked.

Image Classification (in Spanish)

model = model.eval()

from PIL import Image
image = preprocess(Image.open('figs/golf_course_004.jpg')).unsqueeze(0)

text_queries = [
"Una zona residencial con casas.",
"Se pueden ver bloques de edificios en la fábrica.",
"Zonas residenciales densas a ambos lados de la carretera.",
"Muchos aviones en el área abierta.",
"Un lindo gato",
"Un campo de golf con bunkers."
]
text = tokenizer(text_queries)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]

for query, prob in zip(text_queries, text_probs):
    print(f"{query:<60} {prob * 100:5.1f}%")

Output:

Una zona residencial con casas.                                0.0%
Se pueden ver bloques de edificios en la fábrica.              0.0%
Zonas residenciales densas a ambos lados de la carretera.      0.0%
Muchos aviones en el área abierta.                             0.0%
Un lindo gato                                                  0.0%
Un campo de golf con bunkers.                                100.0%

Figure of a golf course with many bunkers.

📚 Documentation

Abstract

Methods based on Contrastive Language - Image Pre - training (CLIP) are widely used in vision - and - language tasks involving remote sensing data. However, the use of different pre - training mechanisms and multilingual inputs has received less attention. This work proposes a novel vision - and - language model, RS - M - CLIP, which explores fine - tuning of a multilingual CLIP model and a self - supervised method. Model training uses pre - existing datasets of remote sensing images paired with English captions, followed by machine translation into nine additional languages. The results show that translated data is helpful, and the model achieves state - of - the - art results in various tasks.

Description

RS - M - CLIP (Remote Sensing Multilingual CLIP) is a CLIP - based model for the remote sensing domain. It improves CLIP's performance without increasing the training data volume by aggregating available image - caption datasets, using a self - distillation method with the contrastive learning objective, and using translated captions. The model starts training from a CLIP model with a multilingual text encoder and a ViT - B vision encoder: https://huggingface.co/laion/CLIP - ViT - B - 32 - xlm - roberta - base - laion5B - s13B - b90k. It can process multiple languages and achieves state - of - the - art results in cross - modal image - text retrieval.

🔧 Technical Details

Model training relied on assembling pre - existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. The model explores the fine - tuning of a multilingual CLIP model and tests the use of a self - supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective.

📄 License

This project is licensed under the MIT license.

Citation

If you find our work useful 🙏, please cite us as:

@article{silva2024large,
  title={Multilingual Vision - Language Pre - training for the Remote Sensing Domain},
  author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
  journal={arXiv:2410.23370},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご