ViT-B-16-SigLIP-i18n-256 Open-source Model - Free Deployment for Zero-shot Image Classification

Vit B 16 SigLIP I18n 256

Developed by timm

A SigLIP (Sigmoid Loss for Language-Image Pre-training) model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot image classification #Multilingual support #Sigmoid loss

Downloads 87.92k

Release Time : 10/17/2023

Model Overview

This model is a vision-language model trained based on SigLIP (Sigmoid Loss for Language-Image Pre-training), primarily used for zero-shot image classification tasks. It can map images and text into the same embedding space, enabling contrastive learning between images and text.

Model Features

Sigmoid loss function

Uses the Sigmoid loss function for language-image pre-training, which handles multi-label classification tasks better compared to traditional Softmax loss functions.

Zero-shot classification

Supports zero-shot image classification, allowing direct application to new categories without task-specific fine-tuning.

Multilingual support

The 'i18n' in the model name indicates internationalization support, enabling processing of text inputs in multiple languages.

Model Capabilities

Zero-shot image classification

Image-text contrastive learning

Multilingual text processing

Use Cases

Image classification

Zero-shot image classification

Classifies images without training, requiring only the category label text.

Accurately identifies image content and matches it to the most relevant text label.

Cross-modal retrieval

Image-text matching

Computes the similarity between images and text for retrieving relevant content.

🚀 ViT-B-16-SigLIP-i18n-256 Model Card

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI, enabling zero-shot image classification.

🚀 Quick Start

This SigLIP model is trained on WebLI and has been converted from the original JAX checkpoints in Big Vision to PyTorch. The weights can be used in both OpenCLIP (image + text) and timm (image only).

✨ Features

Contrastive Image-Text: Capable of handling both image and text data for contrastive learning.
Zero-Shot Image Classification: Can perform image classification without explicit fine - tuning on specific datasets.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch>=2.23.0, timm>=0.9.8

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-B-16-SigLIP-i18n-256')
tokenizer = get_tokenizer('hf-hub:timm/ViT-B-16-SigLIP-i18n-256')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

With `timm` (for image embeddings)

from urllib.request import urlopen
from PIL import Image
import timm

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_base_patch16_siglip_256',
    pretrained=True,
    num_classes=0,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(image).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

📚 Documentation

Model Details

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Dataset	WebLI
Papers	Sigmoid loss for language image pre-training

📄 License

This model is licensed under the apache - 2.0 license.

🔧 Technical Details

This model is a SigLIP model trained on the WebLI dataset. It uses a sigmoid loss function for language - image pre - training. The model has been converted from the original JAX checkpoints in Big Vision to PyTorch, and the weights are compatible with both OpenCLIP and timm.

📖 Citation

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご