đ ViT-L-16-SigLIP-256
A SigLIP model trained on WebLI for zero-shot image classification
This SigLIP (Sigmoid loss for Language-Image Pre-training) model is trained on WebLI. It has been converted from the original JAX checkpoints in Big Vision to PyTorch. These weights can be used in both OpenCLIP (for both image and text) and timm (for image only).
đ Quick Start
This model can be conveniently used in different libraries as shown in the usage examples below.
⨠Features
- Trained on WebLI dataset.
- Converted from JAX to PyTorch.
- Usable in both OpenCLIP and timm.
đĻ Installation
The installation steps rely on the libraries you want to use. For OpenCLIP, ensure open-clip-torch>=2.23.0
and timm>=0.9.8
are installed. For timm
, install the latest version.
đģ Usage Examples
Basic Usage
With OpenCLIP
import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer
model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-L-16-SigLIP-256')
tokenizer = get_tokenizer('hf-hub:timm/ViT-L-16-SigLIP-256')
image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)
labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)
zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)
With timm
(for image embeddings)
from urllib.request import urlopen
from PIL import Image
import timm
image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_large_patch16_siglip_256',
pretrained=True,
num_classes=0,
)
model = model.eval()
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(image).unsqueeze(0))
đ Documentation
Model Details
đ License
This model is released under the apache-2.0
license.
đ Citation
@article{zhai2023sigmoid,
title={Sigmoid loss for language image pre-training},
author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
journal={arXiv preprint arXiv:2303.15343},
year={2023}
}
@misc{big_vision,
author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
title = {Big Vision},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/google-research/big_vision}}
}