🚀 RS-M-CLIP: Multilingual Vision-Language Pre-training for the Remote Sensing Domain
This repository is the official one for the paper “Multilingual Vision-Language Pre-training for the Remote Sensing Domain”. It presents a novel vision - and - language model, RS - M - CLIP, which achieves state - of - the - art results in various vision - and - language tasks in the remote sensing domain.
✨ Features
- Multilingual Support: Supports multiple languages including English, Portuguese, Spanish, French, German, Dutch, Italian, Chinese, Korean, and Russian.
- State - of - the - Art Performance: Achieves excellent results in tasks such as cross - modal and multilingual image - text retrieval, and zero - shot image classification.
- Innovative Training Approach: Explores fine - tuning of a multilingual CLIP model and uses a self - supervised method based on aligning local and global representations from individual input images, along with the standard CLIP objective.
📦 Installation
The model can be loaded using the OpenCLIP library, which will load the weights stored in the Hugging Face Hub.
💻 Usage Examples
Basic Usage
To load the model:
import torch
import open_clip
model, preprocess, preprocess_val = open_clip.create_model_and_transforms('hf - hub:joaodaniel/RS - M - CLIP')
tokenizer = open_clip.get_tokenizer('hf - hub:joaodaniel/RS - M - CLIP')
Advanced Usage
Image Classification (in English)
model = model.eval()
from PIL import Image
image = preprocess(Image.open('figs/airplane_004.jpg')).unsqueeze(0)
text_queries = [
"A residential area with houses.",
"Blocks of buildings can be seen in the factory .",
"Dense residential areas on both sides of the road .",
"Many airplanes in the open area.",
"A cute cat",
]
text = tokenizer(text_queries)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]
for query, prob in zip(text_queries, text_probs):
print(f"{query:<40} {prob * 100:5.1f}%")
Output:
A residential area with houses. 0.0%
Blocks of buildings can be seen in the factory . 0.0%
Dense residential areas on both sides of the road . 0.0%
Many airplanes in the open area. 100.0%
A cute cat 0.0%

Image Classification (in Spanish)
model = model.eval()
from PIL import Image
image = preprocess(Image.open('figs/golf_course_004.jpg')).unsqueeze(0)
text_queries = [
"Una zona residencial con casas.",
"Se pueden ver bloques de edificios en la fábrica.",
"Zonas residenciales densas a ambos lados de la carretera.",
"Muchos aviones en el área abierta.",
"Un lindo gato",
"Un campo de golf con bunkers."
]
text = tokenizer(text_queries)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]
for query, prob in zip(text_queries, text_probs):
print(f"{query:<60} {prob * 100:5.1f}%")
Output:
Una zona residencial con casas. 0.0%
Se pueden ver bloques de edificios en la fábrica. 0.0%
Zonas residenciales densas a ambos lados de la carretera. 0.0%
Muchos aviones en el área abierta. 0.0%
Un lindo gato 0.0%
Un campo de golf con bunkers. 100.0%

📚 Documentation
Abstract
Methods based on Contrastive Language - Image Pre - training (CLIP) are widely used in vision - and - language tasks involving remote sensing data. However, the use of different pre - training mechanisms and multilingual inputs has received less attention. This work proposes a novel vision - and - language model, RS - M - CLIP, which explores fine - tuning of a multilingual CLIP model and a self - supervised method. Model training uses pre - existing datasets of remote sensing images paired with English captions, followed by machine translation into nine additional languages. The results show that translated data is helpful, and the model achieves state - of - the - art results in various tasks.
Description
RS - M - CLIP (Remote Sensing Multilingual CLIP) is a CLIP - based model for the remote sensing domain. It improves CLIP's performance without increasing the training data volume by aggregating available image - caption datasets, using a self - distillation method with the contrastive learning objective, and using translated captions. The model starts training from a CLIP model with a multilingual text encoder and a ViT - B vision encoder: https://huggingface.co/laion/CLIP - ViT - B - 32 - xlm - roberta - base - laion5B - s13B - b90k. It can process multiple languages and achieves state - of - the - art results in cross - modal image - text retrieval.
🔧 Technical Details
Model training relied on assembling pre - existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. The model explores the fine - tuning of a multilingual CLIP model and tests the use of a self - supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective.
📄 License
This project is licensed under the MIT license.
Citation
If you find our work useful 🙏, please cite us as:
@article{silva2024large,
title={Multilingual Vision - Language Pre - training for the Remote Sensing Domain},
author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
journal={arXiv:2410.23370},
year={2024}
}