🚀 Model Card: CLIP
The CLIP model was developed by OpenAI researchers to study robustness in computer vision tasks and test zero - shot generalization ability for image classification.
🚀 Quick Start
The CLIP model is not for general deployment. Researchers need to study its capabilities in specific contexts before deployment. You can use it with the transformers
library as follows:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image - text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
✨ Features
- The CLIP model helps researchers understand what contributes to robustness in computer vision tasks.
- It can test the zero - shot generalization ability of models for arbitrary image classification tasks.
📚 Documentation
Model Details
The CLIP model was developed by OpenAI researchers in January 2021.
Property |
Details |
Model Type |
The base model uses a ViT - B/16 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. |
Model Date |
January 2021 |
Documents |
- Blog Post - CLIP Paper |
Use with Transformers
You can use the CLIP model with the transformers
library as shown in the code example above.
Intended Use
The model is intended as a research output for research communities. It aims to help researchers better understand and explore zero - shot, arbitrary image classification and can be used for interdisciplinary studies of its potential impacts. The primary users are AI researchers.
Out - of - Scope Use Cases
- Any deployed use case of the model (commercial or not) is currently out of scope. Non - deployed use cases like image search in a constrained environment are not recommended without thorough in - domain testing.
- Use cases in the domain of surveillance and facial recognition are always out of scope due to the lack of testing norms and checks for fair use.
- Since the model is not trained or evaluated on non - English languages, its use should be limited to English language use cases.
Data
The model was trained on publicly available image - caption data. The data was gathered by crawling some websites and using pre - existing datasets like YFCC100M. The data is more representative of people in developed nations, younger, and male users.
Data Mission Statement
The goal of building this dataset was to test robustness and generalizability in computer vision tasks. The data was gathered mostly non - interventionist, and websites with policies against violent and adult images were crawled. The dataset is not for commercial or deployed models and will not be released.
Performance
The performance of CLIP was evaluated on a wide range of benchmarks across various computer vision datasets, including:
- Food101
- CIFAR10
- CIFAR100
- Birdsnap
- SUN397
- Stanford Cars
- FGVC Aircraft
- VOC2007
- DTD
- Oxford - IIIT Pet dataset
- Caltech101
- Flowers102
- MNIST
- SVHN
- IIIT5K
- Hateful Memes
- SST - 2
- UCF101
- Kinetics700
- Country211
- CLEVR Counting
- KITTI Distance
- STL - 10
- RareAct
- Flickr30
- MSCOCO
- ImageNet
- ImageNet - A
- ImageNet - R
- ImageNet Sketch
- ObjectNet (ImageNet Overlap)
- Youtube - BB
- ImageNet - Vid
Limitations
- CLIP struggles with tasks like fine - grained classification and counting objects.
- There are fairness and bias issues, which depend on class design and category choices.
- The use of linear probes for testing may underestimate model performance.
Bias and Fairness
- The performance of CLIP and its biases can depend on class design. Testing with images from Fairface showed significant disparities in race and gender.
- For gender classification using the Fairface dataset, accuracy was >96% across all races, with 'Middle Eastern' having the highest (98.4%) and 'White' having the lowest (96.5%). CLIP averaged ~93% for racial classification and ~63% for age classification.
💬 Feedback
Where to send questions or comments about the model
Please use this Google Form