ViT - B - 16 - SigLIP2 Open - Source Vision - Language Model - Free for Zero

Home

Vit B 16 SigLIP2

Developed by timm

A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot Image Classification #Multilingual Vision-Language #Sigmoid Loss Optimization

Downloads 11.26k

Release Time : 2/21/2025

Model Overview

This model is a contrastive image-text model primarily used for zero-shot image classification tasks. It can understand image content and match it with text descriptions, supporting multilingual processing.

Model Features

Multilingual Support

Supports multilingual text understanding, capable of processing image descriptions in different languages.

Zero-shot Classification

Can classify images into new categories without specific training.

Improved Semantic Understanding

Compared to previous models, it has better semantic understanding and localization capabilities.

Dense Feature Extraction

Capable of extracting dense features from images, supporting finer-grained image understanding.

Model Capabilities

Image Classification

Image-Text Matching

Multilingual Processing

Zero-shot Learning

Use Cases

Content Classification

Social Media Image Classification

Automatically classifies images uploaded to social media without prior training on specific categories.

Can accurately identify common objects and scenes

E-commerce

Product Image Classification

Automatically classifies and tags product images on e-commerce platforms.

Supports matching with multilingual product descriptions

🚀 ViT-B-16-SigLIP2 Model Card

This is a SigLIP 2 Vision-Language model trained on WebLI, which has been converted for use in OpenCLIP from the original JAX checkpoints in Big Vision. It is designed for zero-shot image classification tasks.

🚀 Quick Start

The following steps and code example will help you quickly start using the ViT-B-16-SigLIP2 model.

✨ Features

Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
Original: https://github.com/google-research/big_vision
Dataset: WebLI
Papers:
- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786
- Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Training Data	WebLI
Papers	SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features, Sigmoid loss for language image pre-training

📦 Installation

To use this model, you need to install the required libraries. Make sure open-clip-torch >= 2.31.0 and timm >= 1.0.15 are installed.

💻 Usage Examples

Basic Usage

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch >= 2.31.0, timm >= 1.0.15

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-B-16-SigLIP2')
tokenizer = get_tokenizer('hf-hub:timm/ViT-B-16-SigLIP2')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image, normalize=True)
    text_features = model.encode_text(text, normalize=True)
    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [100 * round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

📚 Documentation

The model details and usage are described above. For more information, please refer to the original papers and the source code repository.

📄 License

This model is licensed under the Apache-2.0 license.

📚 Citation

@article{tschannen2025siglip,
  title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
  author={Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and H'enaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua},
  year={2025},
  journal={arXiv preprint arXiv:2502.14786}
}

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご