ViT-B-16-SigLIP-384 Open-source Model - Free Deployment for Zero-shot Image Classification Tasks

Vit B 16 SigLIP 384

Developed by timm

SigLIP (Sigmoid Loss Language-Image Pretraining) model trained on the WebLI dataset for zero-shot image classification tasks

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot classification #Sigmoid loss #Multilingual support

Downloads 4,119

Release Time : 10/16/2023

Model Overview

This model is a contrastive image-text model pretrained using a Sigmoid loss function, suitable for zero-shot image classification tasks. It is based on the ViT-B-16 architecture and trained on the WebLI dataset.

Model Features

Sigmoid loss function

Uses an innovative Sigmoid loss function for language-image pretraining, demonstrating better performance compared to traditional Softmax loss

Zero-shot learning capability

Can classify images into new categories without requiring specific category training

High-resolution input

Supports high-resolution image input at 384x384 pixels

Multi-framework support

Simultaneously supports OpenCLIP (image+text) and timm (image-only) frameworks

Model Capabilities

Zero-shot image classification

Image-text matching

Image feature extraction

Multimodal understanding

Use Cases

Content classification

Social media image classification

Automatically classify and tag images on social media

Can accurately identify objects, scenes, and activities in images

E-commerce

Product image classification

Automatically classify product images on e-commerce platforms

No need to train separate models for each product category

🚀 ViT-B-16-SigLIP-384 Model Card

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI, enabling zero-shot image classification.

This model has been converted to PyTorch from the original JAX checkpoints in Big Vision. These weights are usable in both OpenCLIP (image + text) and timm (image only).

🚀 Quick Start

This SigLIP model is trained on WebLI and can be used for zero-shot image classification. It has been converted from JAX to PyTorch and can be used in OpenCLIP and timm.

✨ Features

Contrastive Image-Text and Zero-Shot Image Classification capabilities.
Weights are compatible with both OpenCLIP (for image and text processing) and timm (for image-only processing).

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch>=2.23.0, timm>=0.9.8

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-B-16-SigLIP-384')
tokenizer = get_tokenizer('hf-hub:timm/ViT-B-16-SigLIP-384')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

With `timm` (for image embeddings)

from urllib.request import urlopen
from PIL import Image
import timm

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch16_siglip_384',
    pretrained=True,
    num_classes=0,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(image).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

📚 Documentation

Model Details

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Dataset	WebLI
Papers	Sigmoid loss for language image pre-training

Citation

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご