ViT - B - 16 - SigLIP Open - Source Model - For Zero - Shot Image Classification, Free and Practical!

Vit B 16 SigLIP

Developed by timm

SigLIP (Sigmoid Loss for Language Image Pre-training) model trained on the WebLI dataset for zero-shot image classification tasks.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot image classification #Sigmoid loss pre-training #Multimodal contrastive learning

Downloads 27.77k

Release Time : 10/16/2023

Model Overview

This model is a contrastive image-text model that uses the Sigmoid loss function for language-image pre-training and supports zero-shot image classification tasks.

Model Features

Sigmoid loss function

Uses the Sigmoid loss function for language-image pre-training, which performs better than traditional Softmax loss functions in certain tasks.

Zero-shot classification capability

Can perform image classification tasks without task-specific fine-tuning.

WebLI dataset training

Trained on the large-scale WebLI dataset, enabling broad visual concept understanding.

Model Capabilities

Image-text contrastive learning

Zero-shot image classification

Image feature extraction

Use Cases

Image classification

Food recognition

Identify types of food in images, such as donuts, beignets, etc.

Can accurately identify various food types

Animal recognition

Identify types of animals in images, such as cats, dogs, etc.

Can accurately identify common animals

Content understanding

Image content description

Understand image content and match relevant text descriptions.

🚀 ViT-B-16-SigLIP Model Card

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI, which can be used for zero-shot image classification.

This model has been converted to PyTorch from the original JAX checkpoints in Big Vision. These weights are usable in both OpenCLIP (image + text) and timm (image only).

✨ Features

This is a contrastive image-text model capable of zero-shot image classification.
Converted from JAX checkpoints in Big Vision to PyTorch, making it compatible with both OpenCLIP and timm.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

# With OpenCLIP
import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch>=2.23.0, timm>=0.9.8

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-B-16-SigLIP')
tokenizer = get_tokenizer('hf-hub:timm/ViT-B-16-SigLIP')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

Advanced Usage

# With `timm` (for image embeddings)
from urllib.request import urlopen
from PIL import Image
import timm

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch16_siglip_224',
    pretrained=True,
    num_classes=0,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(image).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

📚 Documentation

Model Details

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Dataset	WebLI
Papers	- Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343

📄 License

This model is licensed under the Apache-2.0 license.

📚 Citation

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご