đ ViT-SO400M-14-SigLIP-384
A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI, enabling zero-shot image classification.
This SigLIP model, trained on the WebLI dataset, offers a powerful solution for zero-shot image classification. It has been converted from the original JAX checkpoints in Big Vision to PyTorch, making it compatible with both OpenCLIP for combined image and text processing and timm for image-only tasks.
đ Quick Start
This SigLIP model, trained on the WebLI dataset, is a powerful tool for zero-shot image classification. It has been converted from the original JAX checkpoints in Big Vision to PyTorch, and can be used with both OpenCLIP (for image + text) and timm (for image only).
⨠Features
- Contrastive Image-Text: Capable of learning relationships between images and text.
- Zero-Shot Image Classification: Can classify images without prior training on specific classes.
đĻ Installation
Ensure you have the necessary libraries installed:
pip install open-clip-torch>=2.23.0 timm>=0.9.8 torch pillow
đģ Usage Examples
Basic Usage
With OpenCLIP
import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer
model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-SO400M-14-SigLIP-384')
tokenizer = get_tokenizer('hf-hub:timm/ViT-SO400M-14-SigLIP-384')
image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)
labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)
zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)
With timm
(for image embeddings)
from urllib.request import urlopen
from PIL import Image
import timm
image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_so400m_patch14_siglip_384',
pretrained=True,
num_classes=0,
)
model = model.eval()
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(image).unsqueeze(0))
đ Documentation
Model Details
đ License
This model is licensed under the Apache-2.0 license.
đ Citation
@article{zhai2023sigmoid,
title={Sigmoid loss for language image pre-training},
author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
journal={arXiv preprint arXiv:2303.15343},
year={2023}
}
@misc{big_vision,
author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
title = {Big Vision},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/google-research/big_vision}}
}