SigLip-Large-Patch16-384 Open-Source Multimodal Model - Free for Zero-Shot Image Classification and Image-Text Retrieval

Siglip Large Patch16 384

Developed by google

SigLIP is a multimodal model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Zero-shot image classification #Sigmoid loss function #Image-text retrieval

Downloads 245.21k

Release Time : 1/8/2024

Model Overview

SigLIP is an improved loss function version of the CLIP multimodal model, where the Sigmoid loss function operates only on image-text pairs without requiring normalization through global similarity. This characteristic allows the model to perform well both when scaling up batch sizes and in small-batch scenarios.

Model Features

Improved Sigmoid Loss Function

Operates only on image-text pairs without requiring normalization through global similarity, enabling the model to perform well both when scaling up batch sizes and in small-batch scenarios.

High Performance

Excels in zero-shot image classification and image-text retrieval tasks, outperforming traditional CLIP models.

Multimodal Support

Supports dual-modal processing of images and text, suitable for various vision-language tasks.

Model Capabilities

Zero-shot image classification

Image-text retrieval

Multimodal processing

Use Cases

Image Classification

Zero-shot Image Classification

Classifies images without training, supporting custom labels.

Performs excellently on various datasets, outperforming traditional CLIP models.

Image-Text Retrieval

Image Search

Retrieves relevant images based on text descriptions.

Efficient and accurate, suitable for large-scale image libraries.

🚀 SigLIP (large-sized model)

SigLIP is a multimodal model pre - trained on WebLi at 384x384 resolution. It offers better performance in tasks like zero - shot image classification and image - text retrieval.

🚀 Quick Start

SigLIP is a pre - trained model introduced in the paper Sigmoid Loss for Language Image Pre - Training by Zhai et al. It's based on CLIP with a better loss function.

✨ Features

Better Loss Function: The sigmoid loss in SigLIP operates only on image - text pairs, enabling further scaling of batch size and better performance at smaller batch sizes.
Multimodal Capability: Suitable for tasks such as zero - shot image classification and image - text retrieval.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

Here is how to use this model to perform zero - shot image classification:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-large-patch16-384")
processor = AutoProcessor.from_pretrained("google/siglip-large-patch16-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Advanced Usage

Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-large-patch16-384")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

For more code examples, we refer to the documentation.

📚 Documentation

Intended uses & limitations

You can use the raw model for tasks like zero - shot image classification and image - text retrieval. See the model hub to look for other versions on a task that interests you.

Training procedure

Training data

SigLIP is pre - trained on the English image - text pairs of the WebLI dataset (Chen et al., 2023).

Preprocessing

Images are resized/rescaled to the same resolution (384x384) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). Texts are tokenized and padded to the same length (64 tokens).

Compute

The model was trained on 16 TPU - v4 chips for three days.

Evaluation results

Evaluation of SigLIP compared to CLIP is shown below (taken from the paper). drawing

BibTeX entry and citation info

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre - Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This model is released under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご