🚀 Model Card: CLIP
The CLIP model was developed by OpenAI researchers to explore factors contributing to robustness in computer vision tasks and test models' zero - shot generalization ability for arbitrary image classification. It's not for general deployment; specific context study is required before deployment.
🚀 Quick Start
Use with Transformers
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
✨ Features
- Zero - shot Generalization: Can perform arbitrary image classification in a zero - shot manner.
- Contrastive Learning: Trained to maximize the similarity of (image, text) pairs via contrastive loss.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
📚 Documentation
Model Details
- Model Date: January 2021
- Model Type:
Property |
Details |
Model Type |
The base model uses a ViT - L/14 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one with a ResNet image encoder and the other with a Vision Transformer. This repository has the Vision Transformer variant. |
Training Data |
The model was trained on publicly available image - caption data, gathered by crawling websites and using pre - existing datasets like YFCC100M. A large part of the data comes from internet crawling. |
- Documents:
Model Use
Intended Use
The primary users are AI researchers. It's mainly used to understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
Out - of - Scope Use Cases
- Deployment: Any deployed use case, commercial or not, is out of scope currently. Non - deployed use cases like image search in a constrained environment are not recommended without thorough in - domain testing.
- Surveillance and Facial Recognition: Always out of scope due to lack of testing norms for fair use.
- Non - English Use: Since the model is not trained or evaluated on non - English languages, its use should be limited to English use cases.
Data
The model was trained on publicly available image - caption data. The goal was to test robustness and generalizability in computer vision tasks. Data was gathered from various internet sources, with a focus on quantity. Only websites with policies against violent and adult images were crawled. The dataset won't be released and is not for commercial or deployed models.
Performance and Limitations
Performance
The performance of CLIP has been evaluated on a wide range of benchmarks across various computer vision datasets, including:
- Food101
- CIFAR10
- CIFAR100
- Birdsnap
- SUN397
- Stanford Cars
- FGVC Aircraft
- VOC2007
- DTD
- Oxford - IIIT Pet dataset
- Caltech101
- Flowers102
- MNIST
- SVHN
- IIIT5K
- Hateful Memes
- SST - 2
- UCF101
- Kinetics700
- Country211
- CLEVR Counting
- KITTI Distance
- STL - 10
- RareAct
- Flickr30
- MSCOCO
- ImageNet
- ImageNet - A
- ImageNet - R
- ImageNet Sketch
- ObjectNet (ImageNet Overlap)
- Youtube - BB
- ImageNet - Vid
Limitations
- Task Difficulty: Struggles with fine - grained classification and object counting.
- Bias and Fairness: Performance and biases depend on class design. Significant disparities were found in race and gender classification using the Fairface dataset.
- Testing Approach: Linear probes used for evaluation may underestimate model performance.
Feedback
Please use this Google Form to send questions or comments about the model.
⚠️ Important Note
The model card is taken and modified from the official CLIP repository, it can be found here.
💡 Usage Tip
To deploy models like CLIP, researchers need to carefully study their capabilities in relation to the specific deployment context. Also, since the model is mainly trained on English - related data, limit its use to English use cases.