ViT-SO400M-14-SigLIP-384 Open-Source Model - Free for Zero-Shot Image Classification Tasks

Vit SO400M 14 SigLIP 384

Developed by timm

SigLIP (Sigmoid Loss for Language-Image Pretraining) model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot Image Classification #Sigmoid Loss Optimization #Multilingual Text Support

Downloads 158.84k

Release Time : 10/16/2023

Model Overview

This model employs a contrastive image-text pretraining approach, optimized via the Sigmoid loss function, enabling efficient zero-shot image classification.

Model Features

Sigmoid Loss Function

Uses Sigmoid loss for language-image pretraining, enhancing the model's contrastive learning performance.

Zero-shot Classification Capability

Can be directly applied to new image classification tasks without task-specific fine-tuning.

Efficient Visual Encoding

Based on the Vision Transformer architecture, capable of efficiently extracting image features.

Model Capabilities

Image Feature Extraction

Zero-shot Image Classification

Multimodal Contrastive Learning

Use Cases

Image Understanding

Food Recognition

Identify food categories in images, such as donuts, beignets, etc.

Can accurately recognize common food categories

Animal Recognition

Identify animal categories in images, such as cats, dogs, etc.

High recognition accuracy for common animals

Content Moderation

Inappropriate Content Detection

Identify potentially inappropriate content in images.

🚀 ViT-SO400M-14-SigLIP-384

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI, enabling zero-shot image classification.

This SigLIP model, trained on the WebLI dataset, offers a powerful solution for zero-shot image classification. It has been converted from the original JAX checkpoints in Big Vision to PyTorch, making it compatible with both OpenCLIP for combined image and text processing and timm for image-only tasks.

🚀 Quick Start

This SigLIP model, trained on the WebLI dataset, is a powerful tool for zero-shot image classification. It has been converted from the original JAX checkpoints in Big Vision to PyTorch, and can be used with both OpenCLIP (for image + text) and timm (for image only).

✨ Features

Contrastive Image-Text: Capable of learning relationships between images and text.
Zero-Shot Image Classification: Can classify images without prior training on specific classes.

📦 Installation

Ensure you have the necessary libraries installed:

pip install open-clip-torch>=2.23.0 timm>=0.9.8 torch pillow

💻 Usage Examples

Basic Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch>=2.23.0, timm>=0.9.8

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-SO400M-14-SigLIP-384')
tokenizer = get_tokenizer('hf-hub:timm/ViT-SO400M-14-SigLIP-384')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

With `timm` (for image embeddings)

from urllib.request import urlopen
from PIL import Image
import timm

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_so400m_patch14_siglip_384',
    pretrained=True,
    num_classes=0,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(image).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

📚 Documentation

Model Details

Property	Details
Model Type	Contrastive Image-Text, Zero-Shot Image Classification
Original	https://github.com/google-research/big_vision
Training Data	WebLI
Papers	Sigmoid loss for language image pre-training

📄 License

This model is licensed under the Apache-2.0 license.

📚 Citation

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご