🚀 SigLIP 2 So400m
SigLIP 2 扩展了 SigLIP 的预训练目标,将先前独立开发的技术整合为一个统一的方案,以提升语义理解、定位和密集特征提取能力。
🚀 快速开始
你可以使用该原始模型进行零样本图像分类和图像 - 文本检索等任务,也可以将其作为视觉语言模型(VLM)的视觉编码器(以及用于其他视觉任务)。
以下是使用此模型进行零样本图像分类的示例:
from transformers import pipeline
ckpt = "google/siglip2-so400m-patch14-224"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]
outputs = image_classifier(image, candidate_labels)
print(outputs)
你还可以像这样使用视觉塔对图像进行编码:
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image
ckpt = "google/siglip2-so400m-patch14-224"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
image_embeddings = model.get_image_features(**inputs)
print(image_embeddings.shape)
更多代码示例,请参考 siglip 文档。
💻 使用示例
基础用法
from transformers import pipeline
ckpt = "google/siglip2-so400m-patch14-224"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]
outputs = image_classifier(image, candidate_labels)
print(outputs)
高级用法
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image
ckpt = "google/siglip2-so400m-patch14-224"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
image_embeddings = model.get_image_features(**inputs)
print(image_embeddings.shape)
🔧 技术细节
训练过程
SigLIP 2 在 SigLIP 的基础上增加了一些巧妙的训练目标:
- 解码器损失
- 全局 - 局部和掩码预测损失
- 宽高比和分辨率适应性
训练数据
SigLIP 2 在 WebLI 数据集 (Chen et al., 2023) 上进行预训练。
计算资源
该模型在多达 2048 个 TPU - v5e 芯片上进行训练。
📚 详细文档
评估结果
以下是 SigLIP 2 的评估结果(取自论文)。

BibTeX 引用和引用信息
@misc{tschannen2025siglip2multilingualvisionlanguage,
title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
year={2025},
eprint={2502.14786},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.14786},
}
📄 许可证
本项目采用 Apache-2.0 许可证。