ijepa_vith14_22k Open-source Image Representation Prediction Model - Free Implementation of Partial Image Representation Prediction

Ijepa Vith14 22k

Developed by facebook

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on predefined manual data transformations or pixel-level detail filling.

Image Classification

Transformers

#Self-supervised learning #Image feature extraction #Latent space prediction

Downloads 48

Release Time : 8/26/2024

Model Overview

I-JEPA is a self-supervised learning method focused on predicting representations of other parts of an image from partial representations, suitable for image classification and feature extraction tasks.

Model Features

Self-supervised learning

Predicts representations of other parts of an image from partial representations without relying on predefined manual data transformations.

Latent space prediction

Uses a predictor operating in latent space to model spatial uncertainty in static images from partially observable context.

Semantic world model

Predicts high-level information about unobserved regions of an image rather than pixel-level details, exhibiting semantic understanding.

Model Capabilities

Image feature extraction

Image classification

Use Cases

Computer vision

Image similarity computation

Computes similarity between images by extracting image features.

Can accurately capture high-level semantic information of images.

Image classification

Used for image classification tasks by extracting image features for categorization.

🚀 I-JEPA Model (Huge, fine-tuned on IN22K)

I-JEPA is a self - supervised learning method. It predicts the representations of part of an image from those of other parts of the same image, without relying on pre - specified invariances for data transformations and without filling in pixel - level details, thus learning more semantically meaningful representations.

ijepa

🚀 Quick Start

I-JEPA is a self - supervised learning method. It predicts image representations in a novel way, avoiding the drawbacks of traditional methods.

✨ Features

Self - supervised learning: Predicts image representations without relying on pre - specified invariances and pixel - level filling.
Semantic prediction: Focuses on high - level information rather than pixel - level details.
Applicable for multiple tasks: Can be used for image classification and feature extraction.

📚 Documentation

How does it work?

Unlike generative methods with pixel decoders, I-JEPA has a predictor operating in latent space. This predictor can be regarded as a primitive world - model that can handle spatial uncertainty in static images from partial contexts. It predicts high - level information about unseen regions in the image.

We trained a stochastic decoder to map the predicted representations back to pixel space as sketches. The model can capture positional uncertainty and generate high - level object parts with correct poses.

Illustrating how the predictor learns to model the semantics of the world

Intended uses & limitations

I-JEPA can be used for image classification or feature extraction. This specific checkpoint is for Feature Extraction.

💻 Usage Examples

Basic Usage

Here is how to use this model for image feature extraction:

import requests
from PIL import Image
from torch.nn.functional import cosine_similarity

from transformers import AutoModel, AutoProcessor

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)

model_id = "jmtzt/ijepa_vith14_22k"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)


def infer(image):
    inputs = processor(image, return_tensors="pt")
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)


embed_1 = infer(image_1)
embed_2 = infer(image_2)

similarity = cosine_similarity(embed_1, embed_2)
print(similarity)

BibTeX entry and citation info

If you use I-JEPA or this code in your work, please cite:

@article{assran2023self,
  title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
  author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}

📄 License

This project is licensed under the cc-by-nc-4.0 license.

📄 Information Table

Property	Details
Datasets	timm/imagenet-22k-wds
Library Name	transformers
License	cc-by-nc-4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご