Clip-vit-base-patch16 Open-source Multimodal Model - Free Zero-shot Image Classification

Clip Vit Base Patch16

Developed by openai

CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.

Image-to-Text #Zero-shot Image Classification #Multimodal Contrastive Learning #Vision-Text Alignment

Downloads 4.6M

Release Time : 3/2/2022

Model Overview

By jointly training image and text encoders, the CLIP model can perform various image classification tasks without task-specific fine-tuning. Its core innovation is using natural language as supervision signals to achieve flexible zero-shot transfer.

Model Features

Zero-shot Transfer Capability

Can be applied to new image classification tasks without task-specific fine-tuning, requiring only text label descriptions.

Multimodal Alignment

Maps images and text into a shared semantic space through contrastive learning, enabling cross-modal understanding.

Robust Performance

Demonstrates superior robustness compared to traditional supervised models on various distribution-shifted test sets.

Model Capabilities

Zero-shot image classification

Image-text similarity computation

Cross-modal retrieval

Multimodal feature extraction

Use Cases

Academic Research

Computer Vision Robustness Research

Used to study model performance under different distribution shifts.

Demonstrates stronger robustness on ImageNet variant test sets.

Multimodal Representation Learning

Serves as a foundational model for studying vision-language joint representations.

Restricted Application Scenarios

Restricted Image Search

Image retrieval applications within a fixed classification system.

Requires domain-specific testing before deployment.

🚀 Model Card: CLIP

The CLIP model was developed by OpenAI researchers to study robustness in computer vision tasks and test zero - shot generalization ability for image classification.

🚀 Quick Start

The CLIP model is not for general deployment. Researchers need to study its capabilities in specific contexts before deployment. You can use it with the transformers library as follows:

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image - text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

✨ Features

The CLIP model helps researchers understand what contributes to robustness in computer vision tasks.
It can test the zero - shot generalization ability of models for arbitrary image classification tasks.

📚 Documentation

Model Details

The CLIP model was developed by OpenAI researchers in January 2021.

Property	Details
Model Type	The base model uses a ViT - B/16 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
Model Date	January 2021
Documents	- Blog Post - CLIP Paper

Use with Transformers

You can use the CLIP model with the transformers library as shown in the code example above.

Intended Use

The model is intended as a research output for research communities. It aims to help researchers better understand and explore zero - shot, arbitrary image classification and can be used for interdisciplinary studies of its potential impacts. The primary users are AI researchers.

Out - of - Scope Use Cases

Any deployed use case of the model (commercial or not) is currently out of scope. Non - deployed use cases like image search in a constrained environment are not recommended without thorough in - domain testing.
Use cases in the domain of surveillance and facial recognition are always out of scope due to the lack of testing norms and checks for fair use.
Since the model is not trained or evaluated on non - English languages, its use should be limited to English language use cases.

Data

The model was trained on publicly available image - caption data. The data was gathered by crawling some websites and using pre - existing datasets like YFCC100M. The data is more representative of people in developed nations, younger, and male users.

Data Mission Statement

The goal of building this dataset was to test robustness and generalizability in computer vision tasks. The data was gathered mostly non - interventionist, and websites with policies against violent and adult images were crawled. The dataset is not for commercial or deployed models and will not be released.

Performance

The performance of CLIP was evaluated on a wide range of benchmarks across various computer vision datasets, including:

Food101
CIFAR10
CIFAR100
Birdsnap
SUN397
Stanford Cars
FGVC Aircraft
VOC2007
DTD
Oxford - IIIT Pet dataset
Caltech101
Flowers102
MNIST
SVHN
IIIT5K
Hateful Memes
SST - 2
UCF101
Kinetics700
Country211
CLEVR Counting
KITTI Distance
STL - 10
RareAct
Flickr30
MSCOCO
ImageNet
ImageNet - A
ImageNet - R
ImageNet Sketch
ObjectNet (ImageNet Overlap)
Youtube - BB
ImageNet - Vid

Limitations

CLIP struggles with tasks like fine - grained classification and counting objects.
There are fairness and bias issues, which depend on class design and category choices.
The use of linear probes for testing may underestimate model performance.

Bias and Fairness

The performance of CLIP and its biases can depend on class design. Testing with images from Fairface showed significant disparities in race and gender.
For gender classification using the Fairface dataset, accuracy was >96% across all races, with 'Middle Eastern' having the highest (98.4%) and 'White' having the lowest (96.5%). CLIP averaged ~93% for racial classification and ~63% for age classification.

💬 Feedback

Where to send questions or comments about the model

Please use this Google Form

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご