ijepa_vith16_1k Open-source Model - Perform Image Representation Prediction without Manual Transformation and Pixel Padding

Ijepa Vith16 1k

Developed by facebook

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on predefined manual data transformations or pixel-level detail filling.

Image Classification

Transformers

#Self-supervised learning #Image representation prediction #High-level semantic modeling

Downloads 153

Release Time : 8/26/2024

Model Overview

I-JEPA employs a latent space predictor as a foundational world model, capable of modeling spatial uncertainty in static images from partially observable contexts, focusing on predicting high-level information rather than pixel-level details.

Model Features

Self-supervised learning

Does not rely on predefined manual data transformation invariance, avoiding bias towards specific downstream tasks

Latent space prediction

Uses a latent space predictor instead of a pixel decoder, focusing on high-level semantic information rather than pixel-level details

World model

Can serve as a foundational world model, modeling spatial uncertainty in static images from partially observable contexts

Model Capabilities

Image feature extraction

Semantic representation learning

Use Cases

Computer Vision

Image classification

Use extracted features for image classification tasks

Feature extraction

Extract high-level semantic features from images for downstream tasks

🚀 I-JEPA Model (Huge, fine-tuned on IN1K)

I-JEPA is a self - supervised learning method. It predicts the representations of part of an image from those of other parts of the same image. This approach has two key advantages:

It doesn't rely on pre - specified invariances to hand - crafted data transformations, which can be biased for specific downstream tasks.
It avoids having the model fill in pixel - level details, which often leads to learning less semantically meaningful representations.

ijepa

🚀 Quick Start

✨ Features

Self - supervised Learning: I-JEPA is a self - supervised learning method that predicts image representations without relying on pre - defined data transformation invariances or pixel - level filling.
Latent Space Prediction: Instead of a pixel decoder like generative methods, I-JEPA has a predictor that makes predictions in latent space.
Semantic World Model: The predictor in I-JEPA acts as a semantic world model, predicting high - level information about unseen image regions.

📦 Installation

The installation details are not provided in the original document.

💻 Usage Examples

Basic Usage

Here is how to use this model for image feature extraction:

import requests
from PIL import Image
from torch.nn.functional import cosine_similarity

from transformers import AutoModel, AutoProcessor

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)

model_id = "jmtzt/ijepa_vith16_1k"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)


def infer(image):
    inputs = processor(image, return_tensors="pt")
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)


embed_1 = infer(image_1)
embed_2 = infer(image_2)

similarity = cosine_similarity(embed_1, embed_2)
print(similarity)

📚 Documentation

How does it work?

As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space. The predictor in I-JEPA can be seen as a primitive (and restricted) world - model that is able to model spatial uncertainty in a static image from a partially observable context. This world model is semantic in the sense that it predicts high level information about unseen regions in the image, rather than pixel - level details.

We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches. The model correctly captures positional uncertainty and produces high - level object parts with the correct pose (e.g., dog’s head, wolf’s front legs).

Illustrating how the predictor learns to model the semantics of the world

Intended uses & limitations

I-JEPA can be used for image classification or feature extraction. This checkpoint in specific is intended for Feature Extraction.

🔧 Technical Details

The I-JEPA method predicts image representations in a self - supervised way. It uses a predictor in latent space instead of a pixel decoder. The stochastic decoder maps the predicted latent representations back to pixel space as sketches, enabling the model to capture high - level semantic information.

📄 License

The license for this project is cc-by-nc-4.0.

BibTeX entry and citation info

If you use I-JEPA or this code in your work, please cite:

@article{assran2023self,
  title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
  author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}

Information Table

Property	Details
Model Type	I-JEPA Model (Huge, fine - tuned on IN1K)
Training Data	ILSVRC/imagenet - 1k
Library Name	transformers
License	cc - by - nc - 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご