SigLIP 2 Open-Source Vision-Language Encoder - Enhancing Multilingual Semantic Understanding and Feature Extraction Capabilities

Siglip2 Base Patch16 256

Developed by google

SigLIP 2 is a multilingual vision-language encoder with improved semantic understanding, localization, and dense feature extraction capabilities.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Zero-shot Image Classification #Image-Text Retrieval #Multimodal Encoder

Downloads 45.24k

Release Time : 2/17/2025

Model Overview

Building upon SigLIP, SigLIP 2 integrates multiple technologies to enhance performance on vision-language tasks, applicable to zero-shot image classification and image-text retrieval.

Model Features

Enhanced Semantic Understanding

Improved semantic comprehension through techniques like decoder loss integration.

Enhanced Localization Capability

Utilizes global-local and masked prediction losses to improve localization accuracy.

Dense Feature Extraction

Optimized dense feature extraction suitable for various vision tasks.

Aspect Ratio and Resolution Adaptability

Supports multiple aspect ratios and resolutions, enhancing model flexibility.

Model Capabilities

Zero-shot Image Classification

Image-Text Retrieval

Visual Feature Extraction

Use Cases

Image Classification

Zero-shot Image Classification

Classifies images without fine-tuning, supporting custom labels.

Demonstrates excellent performance across multiple datasets.

Image-Text Retrieval

Cross-modal Retrieval

Retrieves relevant images based on text or relevant text based on images.

Pre-trained on WebLI dataset with strong retrieval capabilities.

🚀 SigLIP 2 Base

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, aiming to enhance semantic understanding, localization, and dense features.

🚀 Quick Start

You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

✨ Features

Extends SigLIP's pretraining objective.
Improves semantic understanding, localization, and dense features.

📦 Installation

No installation steps provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is how to use this model to perform zero-shot image classification:

from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-base-patch16-256"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)

Advanced Usage

You can encode an image using the Vision Tower like so:

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "google/siglip2-base-patch16-256"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

# run infernece
with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape)

For more code examples, we refer to the siglip documentation.

📚 Documentation

Intended uses

You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

Training procedure

SigLIP 2 adds some clever training objectives on top of SigLIP:

Decoder loss
Global-local and masked prediction loss
Aspect ratio and resolution adaptibility

Training data

SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023).

Compute

The model was trained on up to 2048 TPU-v5e chips.

Evaluation results

Evaluation of SigLIP 2 is shown below (taken from the paper). Evaluation Table

BibTeX entry and citation info

@misc{tschannen2025siglip2multilingualvisionlanguage,
      title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features}, 
      author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
      year={2025},
      eprint={2502.14786},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14786}, 
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご