Open - source CLIP - vit - base - patch32 model: Understand the relationship between images and text, support zero

Clip Vit Base Patch32

Developed by openai

CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.

Image-to-Text #Zero-shot image classification #Multimodal contrastive learning #English visual understanding

Downloads 14.0M

Release Time : 3/2/2022

Model Overview

The CLIP model trains image and text encoders through contrastive learning to achieve cross-modal understanding, primarily used for researching the robustness and generalization capabilities of computer vision tasks.

Model Features

Zero-shot learning capability

Can perform image classification of new categories without task-specific fine-tuning

Multimodal understanding

Processes both visual and textual information simultaneously, establishing cross-modal associations

Robustness research

Designed specifically for studying the robustness and generalization capabilities of computer vision models

Model Capabilities

Image-text matching

Zero-shot image classification

Cross-modal retrieval

Image understanding

Use Cases

Academic research

Model robustness analysis

Used to study the performance differences of computer vision models across different datasets

The paper presents evaluation results on tasks such as OCR and texture recognition

Cross-modal applications

Image search

Retrieve relevant images through natural language descriptions

🚀 Model Card: CLIP

The CLIP model, developed by OpenAI researchers, aims to explore factors contributing to robustness in computer - vision tasks and test models' zero - shot generalization ability for arbitrary image classification. It's not for general deployment; researchers need to study its capabilities in specific contexts before deployment.

🚀 Quick Start

Use with Transformers

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

✨ Features

Developed to study robustness in computer - vision tasks and zero - shot generalization for image classification.
Uses a ViT - B/32 Transformer as an image encoder and a masked self - attention Transformer as a text encoder.
Trained to maximize (image, text) pair similarity via contrastive loss.

📦 Installation

No installation steps provided in the original document.

📚 Documentation

Model Details

Model Date

January 2021

Model Type

The model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one with a ResNet image encoder and the other with a Vision Transformer. This repository has the Vision Transformer variant.

Documents

Model Use

Intended Use

The model is for research communities. It aims to help researchers understand zero - shot, arbitrary image classification and support interdisciplinary studies on its potential impacts. The primary users are AI researchers, who can use it to study computer - vision model capabilities, biases, and constraints.

Out - of - Scope Use Cases

Deployment: Any deployed use case, commercial or not, is out of scope. Non - deployed use cases like image search in a constrained environment are not recommended without thorough in - domain testing with a specific class taxonomy.
Surveillance and Facial Recognition: Use in surveillance and facial recognition is always out of scope due to the lack of testing norms for fair use.
Language Limitation: Since it's not trained or evaluated on non - English languages, its use should be limited to English use cases.

Data

The model was trained on publicly available image - caption data, collected by crawling websites and using existing datasets like [YFCC100M](http://projects.dfki.uni - kl.de/yfcc100m/). Most data comes from internet crawling, which may be skewed towards more developed nations, younger, and male users.

Data Mission Statement

The goal was to test robustness and generalizability in computer - vision tasks. Data was gathered non - interventionally from websites that prohibit violent and adult images. The dataset is not for commercial or deployed models and will not be released.

Performance and Limitations

Performance

CLIP's performance was evaluated on a wide range of benchmarks across various computer - vision datasets, including Food101, CIFAR10, CIFAR100, etc.

Limitations

Task Difficulty: CLIP struggles with fine - grained classification and object counting.
Bias and Fairness: There are fairness and bias issues, which depend on class design. Disparities in race and gender were found when classifying images from Fairface.
Testing Approach: Using linear probes to evaluate CLIP may underestimate its performance.

Bias and Fairness

The performance and biases of CLIP depend on class design. Tests on Fairface images showed significant race and gender disparities, which can change based on class construction. CLIP achieved >96% accuracy for gender classification across races, ~93% for racial classification, and ~63% for age classification.

🔧 Technical Details

📄 License

No license information provided in the original document.

Feedback

Where to send questions or comments about the model

Please use this Google Form

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご