ViT - L - 16 - SigLIP - 256 Open Source Model - Free to Use for Zero-Shot Image Classification Tasks

Vit L 16 SigLIP 256

Developed by timm

SigLIP (Sigmoid Loss for Language-Image Pre-training) model trained on the WebLI dataset for zero-shot image classification tasks.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot classification #Sigmoid loss #Multilingual support

Downloads 1,516

Release Time : 10/16/2023

Model Overview

This model is a contrastive image-text model pre-trained using the Sigmoid loss function, supporting zero-shot image classification tasks.

Model Features

Sigmoid loss function

Uses the Sigmoid loss function for language-image pre-training, improving the model's contrastive learning performance.

Zero-shot classification

Supports zero-shot image classification, applicable to new categories without task-specific fine-tuning.

Multi-framework support

Supports both OpenCLIP (image + text) and timm (image only) frameworks, offering flexible usage.

Model Capabilities

Image feature extraction

Text feature extraction

Zero-shot image classification

Image-text contrastive learning

Use Cases

Image classification

Zero-shot image classification

Classify new image categories without fine-tuning.

Image retrieval

Text-based image retrieval

Retrieve relevant images based on text descriptions.

🚀 ViT-L-16-SigLIP-256

A SigLIP model trained on WebLI for zero-shot image classification

This SigLIP (Sigmoid loss for Language-Image Pre-training) model is trained on WebLI. It has been converted from the original JAX checkpoints in Big Vision to PyTorch. These weights can be used in both OpenCLIP (for both image and text) and timm (for image only).

🚀 Quick Start

This model can be conveniently used in different libraries as shown in the usage examples below.

✨ Features

Trained on WebLI dataset.
Converted from JAX to PyTorch.
Usable in both OpenCLIP and timm.

📦 Installation

The installation steps rely on the libraries you want to use. For OpenCLIP, ensure open-clip-torch>=2.23.0 and timm>=0.9.8 are installed. For timm, install the latest version.

💻 Usage Examples

Basic Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch>=2.23.0, timm>=0.9.8

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-L-16-SigLIP-256')
tokenizer = get_tokenizer('hf-hub:timm/ViT-L-16-SigLIP-256')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

With `timm` (for image embeddings)

from urllib.request import urlopen
from PIL import Image
import timm

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_large_patch16_siglip_256',
    pretrained=True,
    num_classes=0,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(image).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

📚 Documentation

Model Details

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Dataset	WebLI
Papers	Sigmoid loss for language image pre-training

📄 License

This model is released under the apache-2.0 license.

📚 Citation

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご