SigLIP Open-source AI Model - Free Deployment for Optimal Image and Text Matching Tasks

Siglip Base Patch16 256 Multilingual

Developed by google

SigLIP is an improved CLIP model pre-trained on the WebLi dataset, optimized for image-text matching tasks using a Sigmoid loss function

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot image classification #Multilingual image-text retrieval #Sigmoid loss function

Downloads 175.86k

Release Time : 1/8/2024

Model Overview

Multimodal vision-language model suitable for zero-shot image classification and image-text retrieval tasks, supporting multilingual text input

Model Features

Sigmoid loss function

Improved loss function only requires image-text pairs for computation, eliminating the need for global similarity normalization, enhancing small-batch training effectiveness

Multilingual support

Supports multilingual text input, suitable for cross-language visual understanding tasks

Efficient pre-training

Training completed in just 3 days using 16 TPU-v4 chips

Model Capabilities

Zero-shot image classification

Image-text similarity calculation

Multilingual visual understanding

Use Cases

Content understanding

Social media image classification

Performs multi-label classification on user-uploaded images without fine-tuning

Accuracy outperforms traditional CLIP models (see paper comparison)

Cross-modal retrieval

Image-text search engine

Enables text queries to match relevant images or reverse search functionality

🚀 SigLIP (base-sized model, multilingual)

A pre - trained multimodal model with a better loss function for tasks like zero - shot image classification and image - text retrieval.

🚀 Quick Start

SigLIP is a pre - trained model on WebLi at a resolution of 256x256. It was introduced in the paper Sigmoid Loss for Language Image Pre - Training by Zhai et al. and first released in this repository.

Disclaimer: The team releasing SigLIP did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

SigLIP is an enhanced version of CLIP, a multimodal model, with a better loss function. The sigmoid loss operates only on image - text pairs and doesn't need a global view of pairwise similarities for normalization. This enables further scaling up of the batch size and better performance at smaller batch sizes. A TLDR of SigLIP by one of the authors can be found here.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Here is how to use this model to perform zero - shot image classification:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-base-patch16-256-multilingual")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256-multilingual")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Advanced Usage

Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-256-multilingual")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

For more code examples, we refer to the documentation.

📚 Documentation

Intended uses & limitations

You can use the raw model for tasks like zero - shot image classification and image - text retrieval. See the model hub to look for other versions on a task that interests you.

Training procedure

Training data

SigLIP is pre - trained on the WebLI dataset without language filter (Chen et al., 2023).

Preprocessing

Images are resized/rescaled to the same resolution (256x256) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). Texts are tokenized and padded to the same length (64 tokens).

Compute

The model was trained on 16 TPU - v4 chips for three days.

Evaluation results

Evaluation of SigLIP compared to CLIP is shown below (taken from the paper). drawing

BibTeX entry and citation info

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre - Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご