đ Model Card: CLIP
The CLIP model, developed by OpenAI researchers, aims to explore factors contributing to robustness in computer - vision tasks and test models' zero - shot generalization ability for arbitrary image classification. It's not for general deployment; researchers need to study its capabilities in specific contexts before deployment.
đ Quick Start
Use with Transformers
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
⨠Features
- Developed to study robustness in computer - vision tasks and zero - shot generalization for image classification.
- Uses a ViT - B/32 Transformer as an image encoder and a masked self - attention Transformer as a text encoder.
- Trained to maximize (image, text) pair similarity via contrastive loss.
đĻ Installation
No installation steps provided in the original document.
đ Documentation
Model Details
Model Date
January 2021
Model Type
The model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one with a ResNet image encoder and the other with a Vision Transformer. This repository has the Vision Transformer variant.
Documents
Model Use
Intended Use
The model is for research communities. It aims to help researchers understand zero - shot, arbitrary image classification and support interdisciplinary studies on its potential impacts. The primary users are AI researchers, who can use it to study computer - vision model capabilities, biases, and constraints.
Out - of - Scope Use Cases
- Deployment: Any deployed use case, commercial or not, is out of scope. Non - deployed use cases like image search in a constrained environment are not recommended without thorough in - domain testing with a specific class taxonomy.
- Surveillance and Facial Recognition: Use in surveillance and facial recognition is always out of scope due to the lack of testing norms for fair use.
- Language Limitation: Since it's not trained or evaluated on non - English languages, its use should be limited to English use cases.
Data
The model was trained on publicly available image - caption data, collected by crawling websites and using existing datasets like [YFCC100M](http://projects.dfki.uni - kl.de/yfcc100m/). Most data comes from internet crawling, which may be skewed towards more developed nations, younger, and male users.
Data Mission Statement
The goal was to test robustness and generalizability in computer - vision tasks. Data was gathered non - interventionally from websites that prohibit violent and adult images. The dataset is not for commercial or deployed models and will not be released.
Performance and Limitations
Performance
CLIP's performance was evaluated on a wide range of benchmarks across various computer - vision datasets, including Food101, CIFAR10, CIFAR100, etc.
Limitations
- Task Difficulty: CLIP struggles with fine - grained classification and object counting.
- Bias and Fairness: There are fairness and bias issues, which depend on class design. Disparities in race and gender were found when classifying images from Fairface.
- Testing Approach: Using linear probes to evaluate CLIP may underestimate its performance.
Bias and Fairness
The performance and biases of CLIP depend on class design. Tests on Fairface images showed significant race and gender disparities, which can change based on class construction. CLIP achieved >96% accuracy for gender classification across races, ~93% for racial classification, and ~63% for age classification.
đ§ Technical Details
The model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
đ License
No license information provided in the original document.
Feedback
Where to send questions or comments about the model
Please use this Google Form