SigLIP 2 Open-Source Vision-Language Model - Free Deployment to Enhance Semantic Understanding and Feature Extraction

Siglip2 Base Patch16 512

Developed by google

SigLIP 2 is a vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot Image Classification #Image-Text Retrieval #Multimodal Encoder

Downloads 28.01k

Release Time : 2/17/2025

Model Overview

Based on SigLIP's pretraining objectives, SigLIP 2 improves performance in vision-language tasks through a unified training scheme, suitable for zero-shot image classification, image-text retrieval, and more.

Model Features

Unified Training Scheme

Integrates multiple independently developed technologies into a unified training scheme, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Multi-task Support

Supports tasks such as zero-shot image classification and image-text retrieval, and can serve as a visual encoder for vision-language models.

Innovative Training Objectives

Introduces innovative training objectives including decoder loss, global-local and masked prediction loss, aspect ratio, and resolution adaptability.

Model Capabilities

Zero-shot Image Classification

Image-Text Retrieval

Visual Encoding

Use Cases

Image Classification

Zero-shot Image Classification

Classifies images using candidate labels without the need for pre-training models on specific categories.

Image-Text Retrieval

Image-Text Matching

Matches images with text for retrieving relevant images or text.

🚀 SigLIP 2 Base

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, enhancing semantic understanding, localization, and dense features.

🚀 Quick Start

SigLIP 2 is an advanced model that can be used for various vision-related tasks. You can use it for zero - shot image classification, image - text retrieval, or as a vision encoder for VLMs and other vision tasks.

✨ Features

Extended Pretraining: SigLIP 2 extends the pretraining objective of SigLIP with independently developed techniques.
Improved Performance: It offers better semantic understanding, localization, and dense features.
Versatile Applications: Suitable for zero - shot image classification, image - text retrieval, and as a vision encoder.

💻 Usage Examples

Basic Usage

from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-base-patch16-512"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)

Advanced Usage

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "google/siglip2-base-patch16-512"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

# run infernece
with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape)

For more code examples, we refer to the siglip documentation.

🔧 Technical Details

Training procedure

SigLIP 2 adds some clever training objectives on top of SigLIP:

Decoder loss
Global - local and masked prediction loss
Aspect ratio and resolution adaptibility

Training data

SigLIP 2 is pre - trained on the WebLI dataset (Chen et al., 2023).

Compute

The model was trained on up to 2048 TPU - v5e chips.

📚 Documentation

Evaluation of SigLIP 2 is shown below (taken from the paper).

Evaluation Table

BibTeX entry and citation info

@misc{tschannen2025siglip2multilingualvisionlanguage,
      title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features}, 
      author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
      year={2025},
      eprint={2502.14786},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14786}, 
}

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご