SigLIP-base-patch16-512 Open-Source Vision-Language Model Helps You Efficiently Complete Image Classification and Image-Text Retrieval for Free

Siglip Base Patch16 512

Developed by google

SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved sigmoid loss function, excelling in image classification and image-text retrieval tasks.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot image classification #Image-text retrieval #Sigmoid loss function

Downloads 237.79k

Release Time : 1/8/2024

Model Overview

SigLIP is an enhanced CLIP multimodal model with a modified sigmoid loss function that operates solely on image-text pairs, eliminating the need for global similarity normalization. This allows the model to perform well both with large batch sizes and in small-batch scenarios.

Model Features

Improved sigmoid loss function

Operates only on image-text pairs without global similarity normalization, enhancing performance in small-batch scenarios

Efficient pretraining

Pretrained on the WebLi dataset, supporting image processing at 512x512 resolution

Zero-shot learning capability

Can be directly applied to image classification and retrieval tasks without fine-tuning

Model Capabilities

Zero-shot image classification

Image-text retrieval

Multimodal understanding

Use Cases

Image understanding

Animal image classification

Identifying animal categories in images (e.g., cats, dogs)

Accurately distinguishes between different animal categories

Scene understanding

Recognizing scenes or activities in images (e.g., playing music, engaging in sports)

Can comprehend activity types in complex scenes

Content retrieval

Image-text matching

Retrieving relevant images based on text descriptions

Efficiently matches text with image content

🚀 SigLIP (base-sized model)

SigLIP is a multimodal model pre - trained on WebLi at 512x512 resolution, offering better performance in image - text tasks.

🚀 Quick Start

SigLIP model is pre - trained on WebLi at resolution 512x512. It was introduced in the paper Sigmoid Loss for Language Image Pre - Training by Zhai et al. and first released in [this repository](https://github.com/google - research/big_vision).

Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

SigLIP is CLIP, a multimodal model, with a better loss function. The sigmoid loss operates solely on image - text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.

A TLDR of SigLIP by one of the authors can be found here.

📚 Documentation

Intended uses & limitations

You can use the raw model for tasks like zero - shot image classification and image - text retrieval. See the model hub to look for other versions on a task that interests you.

💻 Usage Examples

Basic Usage

Here is how to use this model to perform zero - shot image classification:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip - base - patch16 - 512")
processor = AutoProcessor.from_pretrained("google/siglip - base - patch16 - 512")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Advanced Usage

Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero - shot image classification", model="google/siglip - base - patch16 - 512")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

For more code examples, we refer to the documentation.

🔧 Technical Details

Training data

SigLIP is pre - trained on the English image - text pairs of the WebLI dataset (Chen et al., 2023).

Preprocessing

Images are resized/rescaled to the same resolution (512x512) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Texts are tokenized and padded to the same length (64 tokens).

Compute

The model was trained on 16 TPU - v4 chips for three days.

Evaluation results

Evaluation of SigLIP compared to CLIP is shown below (taken from the paper).

drawing

BibTeX entry and citation info

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre - Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご