Siglip-so400m-patch16-256-i18n Open - source Multimodal Model - Supports Zero - shot Image Classification and Image

Siglip So400m Patch16 256 I18n

Developed by google

A multimodal model based on the SoViT backbone network, improved with the Sigmoid loss function, supporting zero-shot image classification and image-text retrieval

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Zero-shot Image Classification #Multimodal Sigmoid Loss #Multilingual Image-Text Matching

Downloads 230

Release Time : 10/21/2024

Model Overview

SigLIP is an improved vision-language pretraining model based on CLIP, optimized with the Sigmoid loss function for better training efficiency, supporting larger batch training and performing better in small batch scenarios

Model Features

Sigmoid Loss Function

Operates only on image-text pairs, eliminating the need for global similarity normalization, and supports larger batch training

Computationally Optimal Architecture

Uses the shape-optimized SoViT-400m version to maximize computational efficiency

Multilingual Support

Pretrained on 256-resolution multilingual corpora, supporting international applications

Model Capabilities

Zero-shot Image Classification

Image-Text Retrieval

Multimodal Understanding

Use Cases

Content Classification

Animal Recognition

Identify animals such as cats and dogs in images

Examples show accurate differentiation between cat and dog images

Media Analysis

Scene Understanding

Identify activity types in images (e.g., playing music, sports)

🚀 SigLIP (shape-optimized model)

SigLIP is a multimodal model with a better loss function, suitable for tasks like zero-shot image classification and image-text retrieval.

🚀 Quick Start

SigLIP model with SoViT backbone pre - trained on multilingual corpus at resolution 256. It was introduced in the paper Sigmoid Loss for Language Image Pre - Training by Zhai et al. and first released in this repository.

This model has the SoViT - 400m architecture, which is the shape - optimized version as presented in Getting ViT in Shape: Scaling Laws for Compute - Optimal Model Design by Alabdulmohsin et al.

Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

SigLIP is CLIP, a multimodal model, with a better loss function. The sigmoid loss operates solely on image - text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.

A TLDR of SigLIP by one of the authors can be found here.

💻 Usage Examples

Basic Usage

Here is how to use this model to perform zero - shot image classification:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-so400m-patch16-256-i18n")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch16-256-i18n")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Advanced Usage

Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch16-256-i18n")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

For more code examples, we refer to the documentation.

🔧 Technical Details

Training Data

SigLIP is pre - trained on the WebLI dataset (Chen et al., 2023).

Preprocessing

Images are resized/rescaled to the same resolution (384x384) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Texts are tokenized and padded to the same length (64 tokens).

Compute

The model was trained on 16 TPU - v4 chips for three days.

📚 Documentation

Evaluation of SigLIP compared to CLIP is shown below (taken from the paper).

drawing

BibTeX entry and citation info

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

The model is licensed under the Apache - 2.0 license.

Property	Details
Model Type	SigLIP (shape - optimized model)
Training Data	WebLI dataset (Chen et al., 2023)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご