🚀 SigLIP 2 Base
SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features.
🚀 Quick Start
SigLIP 2 is an advanced model that enhances the pre - training objective of SigLIP. It can be used for various vision - related tasks.
✨ Features
- Versatile Applications: Suitable for tasks like zero - shot image classification and image - text retrieval, and can serve as a vision encoder for VLMs and other vision tasks.
- Enhanced Pretraining: Incorporates additional training objectives such as decoder loss, global - local and masked prediction loss, and aspect ratio and resolution adaptability.
📦 Installation
The installation process is mainly about using the transformers
library. You can install it via pip install transformers
.
💻 Usage Examples
Basic Usage
from transformers import pipeline
ckpt = "google/siglip2-base-patch32-256"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]
outputs = image_classifier(image, candidate_labels)
print(outputs)
Advanced Usage
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image
ckpt = "google/siglip2-base-patch32-256"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
image_embeddings = model.get_image_features(**inputs)
print(image_embeddings.shape)
For more code examples, refer to the siglip documentation.
📚 Documentation
Intended uses
You can use the raw model for tasks like zero - shot image classification and image - text retrieval, or as a vision encoder for VLMs (and other vision tasks).
Training procedure
SigLIP 2 adds some clever training objectives on top of SigLIP:
- Decoder loss
- Global - local and masked prediction loss
- Aspect ratio and resolution adaptibility
Training data
SigLIP 2 is pre - trained on the WebLI dataset (Chen et al., 2023).
Compute
The model was trained on up to 2048 TPU - v5e chips.
Evaluation results
Evaluation of SigLIP 2 is shown below (taken from the paper).

BibTeX entry and citation info
@misc{tschannen2025siglip2multilingualvisionlanguage,
title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
year={2025},
eprint={2502.14786},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.14786},
}
📄 License
This project is licensed under the Apache - 2.0 license.