🚀 CLIP (OpenAI model for timm)
The CLIP model developed by OpenAI researchers, aiming to explore robustness in computer vision tasks and test zero - shot generalization ability for arbitrary image classification.
🚀 Quick Start
This instance of the CLIP model is intended for loading in
timm
(https://github.com/rwightman/pytorch-image-models) and
OpenCLIP
(https://github.com/mlfoundations/open_clip) libraries.
Please see https://huggingface.co/openai/clip-vit-base-patch16 for use in Hugging Face Transformers.
✨ Features
The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks and test the zero - shot generalization ability for arbitrary image classification tasks. It's not for general model deployment.
📚 Documentation
Model Details
The CLIP model was developed by OpenAI researchers to study factors contributing to robustness in computer - vision tasks and test zero - shot generalization ability for arbitrary image classification. It's not for general deployment.
Model Date
January 2021
Model Type
Property |
Details |
Model Type |
The model uses a ViT - B/16 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. |
Training Data |
The model was trained on publicly available image - caption data, gathered by crawling websites and using pre - existing datasets like [YFCC100M](http://projects.dfki.uni - kl.de/yfcc100m/). The data is skewed towards more developed nations, younger, and male users. |
Documents
Model Use
Intended Use
The model is a research output for research communities. It aims to help researchers understand and explore zero - shot, arbitrary image classification and can be used for interdisciplinary studies of potential impacts.
Primary intended uses
The primary users are AI researchers. The model is mainly used to understand the robustness, generalization, capabilities, biases, and constraints of computer - vision models.
Out - of - Scope Use Cases
⚠️ Important Note
Any deployed use case of the model (commercial or not) is currently out of scope. Non - deployed use cases like image search in a constrained environment are not recommended without thorough in - domain testing. Use cases related to surveillance and facial recognition are always out of scope. Also, since the model is only trained and evaluated in English, its use should be limited to English language cases.
Data
Data Mission Statement
The goal of building this dataset was to test robustness and generalizability in computer - vision tasks. The data was gathered from different public internet sources in a mostly non - interventionist way, filtering out violent and adult content. The dataset is not for commercial or deployed models and won't be released.
Limitations
CLIP and its analysis have limitations. CLIP struggles with tasks like fine - grained classification and object counting. It also has fairness and bias issues. Additionally, using linear probes to test CLIP may underestimate its performance.
Bias and Fairness
💡 Usage Tip
The performance and biases of CLIP depend on class design. Testing with the Fairface dataset showed significant disparities in race and gender, and these disparities can change based on class construction. CLIP achieved high accuracy in gender classification, ~93% in racial classification, and ~63% in age classification. Evaluations are for assessing performance and potential risks, not for endorsement.