ViT-B-16-SigLIP-512 Open-Source Model - Free for Zero-Shot Image Classification Tasks!

Vit B 16 SigLIP 512

Developed by timm

SigLIP (Sigmoid Loss Language-Image Pretraining) model trained on the WebLI dataset for zero-shot image classification tasks

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot Image Classification #Sigmoid Loss #Multilingual Support

Downloads 3,787

Release Time : 10/16/2023

Model Overview

This is a contrastive image-text model that uses Sigmoid loss for language-image pretraining, particularly suitable for zero-shot image classification tasks. The model was converted from a JAX checkpoint to PyTorch format and can be used in OpenCLIP and timm.

Model Features

Sigmoid Loss Function

Uses Sigmoid loss instead of traditional Softmax loss for language-image pretraining, improving model performance

Zero-shot Classification Capability

Can be directly applied to new image classification tasks without task-specific fine-tuning

Multi-framework Support

Supports both OpenCLIP (image + text) and timm (image only) frameworks

Model Capabilities

Zero-shot Image Classification

Image Feature Extraction

Text Feature Extraction

Image-Text Matching

Use Cases

Image Recognition

Food Recognition

Identify food categories in images, such as donuts, beignets, etc.

Can output probability distributions for each category

Content Moderation

Inappropriate Content Detection

Detect whether an image contains specific categories of inappropriate content

🚀 Model card for ViT-B-16-SigLIP-512

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI. This model addresses the need for effective zero - shot image classification by leveraging the Sigmoid loss in language - image pre - training, providing high - quality results for image - text contrastive tasks.

🚀 Quick Start

This model has been converted to PyTorch from the original JAX checkpoints in Big Vision. These weights are usable in both OpenCLIP (image + text) and timm (image only).

✨ Features

Model Type: Contrastive Image - Text, Zero - Shot Image Classification.
Original: https://github.com/google-research/big_vision
Dataset: WebLI
Papers:
- Sigmoid loss for language image pre - training: https://arxiv.org/abs/2303.15343

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Dataset	WebLI
Papers	Sigmoid loss for language image pre - training: https://arxiv.org/abs/2303.15343

💻 Usage Examples

Basic Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch>=2.23.0, timm>=0.9.8

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-B-16-SigLIP-512')
tokenizer = get_tokenizer('hf-hub:timm/ViT-B-16-SigLIP-512')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

Advanced Usage

With `timm` (for image embeddings)

from urllib.request import urlopen
from PIL import Image
import timm

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch16_siglip_512',
    pretrained=True,
    num_classes=0,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(image).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

📄 License

This model is licensed under the apache - 2.0 license.

📚 Documentation

Citation

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご