ViT-SO400M-14-SigLIP2 Open-source Vision-language Model - Free for Zero-shot Image Classification Tasks

Home

Vit SO400M 14 SigLIP2

Developed by timm

A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot Image Classification #Multimodal Contrastive Learning #WebLI Pretraining

Downloads 1,178

Release Time : 2/21/2025

Model Overview

This model is a contrastive image-text model primarily designed for zero-shot image classification tasks. Based on the SigLIP 2 architecture and trained on the WebLI dataset, it features improved semantic understanding and localization capabilities.

Model Features

Enhanced Semantic Understanding

Based on the SigLIP 2 architecture, it offers better semantic understanding than its predecessors

Zero-shot Classification Capability

Capable of classifying unseen categories without specific training

Dense Feature Extraction

Can extract dense features from images, supporting finer-grained image understanding

Multilingual Support

Supports text input in multiple languages (inferred from paper description)

Model Capabilities

Zero-shot Image Classification

Image-Text Matching

Multimodal Feature Extraction

Cross-modal Retrieval

Use Cases

Image Classification

Zero-shot Object Recognition

Recognizes objects of new categories without training

Accurately identifies the example beignet

Content Understanding

Image Semantic Understanding

Understands image content and matches relevant text descriptions

🚀 ViT-SO400M-14-SigLIP2 Model Card

This is a SigLIP 2 Vision-Language model trained on WebLI. It has been converted for use in OpenCLIP from the original JAX checkpoints in Big Vision, designed for zero-shot image classification tasks.

✨ Features

Contrastive Image-Text: Capable of learning joint representations of images and text through contrastive learning.
Zero-Shot Image Classification: Can classify images into categories without explicit training on those specific categories.

📦 Installation

The code example in this README requires open-clip-torch >= 2.31.0 and timm >= 1.0.15. You can install them using the following commands:

pip install open-clip-torch>=2.31.0 timm>=1.0.15

💻 Usage Examples

Basic Usage

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch >= 2.31.0, timm >= 1.0.15

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-SO400M-14-SigLIP2')
tokenizer = get_tokenizer('hf-hub:timm/ViT-SO400M-14-SigLIP2')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image, normalize=True)
    text_features = model.encode_text(text, normalize=True)
    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [100 * round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

📚 Documentation

Model Details

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Dataset	WebLI
Papers	- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343

📄 License

This model is licensed under the Apache-2.0 license.

📖 Citation

@article{tschannen2025siglip,
  title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
  author={Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and H'enaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua},
  year={2025},
  journal={arXiv preprint arXiv:2502.14786}
}

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご