Siglip-base-patch16-384 Open-source Multimodal Model - Freely Achieve Zero-shot Image Classification and Image-text Retrieval

Siglip Base Patch16 384

Developed by google

SigLIP is a multimodal model pre-trained on the WebLi dataset, employing an improved sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Zero-shot image classification #Image-text retrieval #Sigmoid loss function

Downloads 2,570

Release Time : 1/8/2024

Model Overview

SigLIP is an improved loss function version of the CLIP multimodal model, where the sigmoid loss function only operates on image-text pairs without requiring normalization through global similarity. Suitable for tasks such as zero-shot image classification and image-text retrieval.

Model Features

Improved Loss Function

Uses a sigmoid loss function that only operates on image-text pairs without requiring normalization through global similarity, enabling the model to perform well in both large and small batch scenarios.

Efficient Training

Training can be completed in just three days on 16 TPU-v4 chips.

High-Resolution Support

Supports image inputs with a resolution of 384x384.

Model Capabilities

Zero-shot image classification

Image-text retrieval

Use Cases

Image Classification

Animal Recognition

Identify the type of animal in an image, such as cats, dogs, etc.

Can accurately identify the type of animal in an image.

Image-Text Retrieval

Image Search

Search for relevant images based on text descriptions.

Can efficiently retrieve relevant images based on text descriptions.

🚀 SigLIP (base-sized model)

A pre - trained multimodal model with a better loss function for zero - shot image classification and image - text retrieval.

🚀 Quick Start

SigLIP is a pre - trained model on WebLi at a resolution of 384x384. It combines the concept of CLIP with an improved loss function, enabling better performance in various multimodal tasks.

✨ Features

Better Loss Function: The sigmoid loss in SigLIP operates on image - text pairs without the need for a global view of pairwise similarities for normalization, allowing for larger batch sizes and better performance at smaller batch sizes.
Multimodal Capabilities: Suitable for tasks like zero - shot image classification and image - text retrieval.

📦 Installation

To use this model, you need to install the transformers library. You can install it via pip:

pip install transformers

You also need torch and Pillow for image processing. You can install them as follows:

pip install torch Pillow requests

💻 Usage Examples

Basic Usage

Here is how to use this model to perform zero - shot image classification:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-base-patch16-384")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Advanced Usage

You can also leverage the pipeline API which abstracts away the complexity for the user:

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-384")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

For more code examples, refer to the documentation.

📚 Documentation

Model Description

SigLIP is an enhanced version of CLIP, a multimodal model, with a better loss function. The sigmoid loss operates solely on image - text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.

A TLDR of SigLIP by one of the authors can be found here.

Intended Uses & Limitations

You can use the raw model for tasks like zero - shot image classification and image - text retrieval. Check the model hub for other versions on a task that interests you.

Training Procedure

Training Data

SigLIP is pre - trained on the English image - text pairs of the WebLI dataset (Chen et al., 2023).

Preprocessing

Images: Resized/rescaled to the same resolution (384x384) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
Texts: Tokenized and padded to the same length (64 tokens).

Compute

The model was trained on 16 TPU - v4 chips for three days.

Evaluation Results

Evaluation of SigLIP compared to CLIP is shown below (taken from the paper). drawing

BibTeX entry and citation info

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご