clip-vit-large-patch14 Open-source Vision-Language Model - Free Deployment for Zero-shot Image Classification

Clip Vit Large Patch14

Developed by openai

CLIP is a vision-language model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, supporting zero-shot image classification.

Image-to-Text #Zero-shot image classification #Multimodal contrastive learning #Open-domain visual understanding

Downloads 44.7M

Release Time : 3/2/2022

Model Overview

The CLIP model learns semantic correspondences between images and text by jointly training an image encoder and a text encoder, enabling tasks such as zero-shot image classification and cross-modal retrieval.

Model Features

Zero-shot learning capability

Can perform new image classification tasks without task-specific fine-tuning.

Multimodal understanding

Simultaneously comprehends visual and textual information, establishing cross-modal associations.

Strong generalization

Demonstrates excellent generalization performance across a wide range of datasets.

Model Capabilities

Zero-shot image classification

Image-text matching

Cross-modal retrieval

Multimodal feature extraction

Use Cases

Computer vision research

Robustness research

Investigates the robustness and generalization of computer vision models.

Performance evaluated on 30+ datasets.

Zero-shot classification

Classifies images into arbitrary categories without training.

Cross-modal applications

Image search

Searches for relevant images using natural language queries.

🚀 Model Card: CLIP

The CLIP model was developed by OpenAI researchers to explore factors contributing to robustness in computer vision tasks and test models' zero - shot generalization ability for arbitrary image classification. It's not for general deployment; specific context study is required before deployment.

🚀 Quick Start

Use with Transformers

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

✨ Features

Zero - shot Generalization: Can perform arbitrary image classification in a zero - shot manner.
Contrastive Learning: Trained to maximize the similarity of (image, text) pairs via contrastive loss.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Model Date: January 2021

Model Type:

Property	Details
Model Type	The base model uses a ViT - L/14 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one with a ResNet image encoder and the other with a Vision Transformer. This repository has the Vision Transformer variant.
Training Data	The model was trained on publicly available image - caption data, gathered by crawling websites and using pre - existing datasets like YFCC100M. A large part of the data comes from internet crawling.

Property

Details

Model Type

The base model uses a ViT - L/14 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one with a ResNet image encoder and the other with a Vision Transformer. This repository has the Vision Transformer variant.

Training Data

The model was trained on publicly available image - caption data, gathered by crawling websites and using pre - existing datasets like YFCC100M. A large part of the data comes from internet crawling.

Documents:
- Blog Post
- CLIP Paper

Model Use

Intended Use

The primary users are AI researchers. It's mainly used to understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.

Out - of - Scope Use Cases

Deployment: Any deployed use case, commercial or not, is out of scope currently. Non - deployed use cases like image search in a constrained environment are not recommended without thorough in - domain testing.
Surveillance and Facial Recognition: Always out of scope due to lack of testing norms for fair use.
Non - English Use: Since the model is not trained or evaluated on non - English languages, its use should be limited to English use cases.

Data

The model was trained on publicly available image - caption data. The goal was to test robustness and generalizability in computer vision tasks. Data was gathered from various internet sources, with a focus on quantity. Only websites with policies against violent and adult images were crawled. The dataset won't be released and is not for commercial or deployed models.

Performance and Limitations

Performance

The performance of CLIP has been evaluated on a wide range of benchmarks across various computer vision datasets, including:

Food101
CIFAR10
CIFAR100
Birdsnap
SUN397
Stanford Cars
FGVC Aircraft
VOC2007
DTD
Oxford - IIIT Pet dataset
Caltech101
Flowers102
MNIST
SVHN
IIIT5K
Hateful Memes
SST - 2
UCF101
Kinetics700
Country211
CLEVR Counting
KITTI Distance
STL - 10
RareAct
Flickr30
MSCOCO
ImageNet
ImageNet - A
ImageNet - R
ImageNet Sketch
ObjectNet (ImageNet Overlap)
Youtube - BB
ImageNet - Vid

Limitations

Task Difficulty: Struggles with fine - grained classification and object counting.
Bias and Fairness: Performance and biases depend on class design. Significant disparities were found in race and gender classification using the Fairface dataset.
Testing Approach: Linear probes used for evaluation may underestimate model performance.

Feedback

Please use this Google Form to send questions or comments about the model.

⚠️ Important Note

The model card is taken and modified from the official CLIP repository, it can be found here.

💡 Usage Tip

To deploy models like CLIP, researchers need to carefully study their capabilities in relation to the specific deployment context. Also, since the model is mainly trained on English - related data, limit its use to English use cases.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご