ijepa_vith14_1k open-source model - Achieve partial image representation prediction without manual data transformation

Ijepa Vith14 1k

Developed by facebook

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on manual data transformations or filling in pixel-level details.

Image Classification

Transformers

#Self-supervised learning #Image representation prediction #High-level semantic modeling

Downloads 8,239

Release Time : 8/25/2024

Model Overview

I-JEPA employs a latent space predictor as a foundational world model, capable of modeling spatial uncertainty in static images through partially observable contexts, focusing on predicting high-level information in unseen regions of the image.

Model Features

Self-supervised learning

Learns from the image content itself without manual annotations.

High-level semantic prediction

Predicts high-level information in unseen image regions rather than pixel-level details.

Latent space predictor

Serves as a foundational world model capable of modeling spatial uncertainty.

Model Capabilities

Image feature extraction

Image semantic understanding

Self-supervised learning

Use Cases

Computer Vision

Image classification

Uses extracted features for image classification tasks.

Feature extraction

Extracts high-level semantic features from images for downstream tasks.

🚀 I-JEPA Model (Huge, fine-tuned on IN1K)

I-JEPA is a self - supervised learning method. It predicts the representations of part of an image from those of other parts of the same image. This approach avoids relying on pre - specified invariances for hand - crafted data transformations, which may be biased for specific downstream tasks. It also refrains from having the model fill in pixel - level details, which often leads to learning less semantically meaningful representations.

🚀 Quick Start

Dataset and Library

Property	Details
Datasets	ILSVRC/imagenet - 1k
Library Name	transformers
License	cc - by - nc - 4.0

Model Introduction

I-JEPA is a self - supervised learning method. At a high level, it predicts the representations of part of an image from the representations of other parts of the same image:

without relying on pre - specified invariances to hand - crafted data transformations, which tend to be biased for particular downstream tasks,
and without having the model fill in pixel - level details, which tend to result in learning less semantically meaningful representations.

![ijepa](https://github.com/facebookresearch/ijepa/assets/7530871/dbad94ab - ac35 - 433b - 8b4c - ca227886d311)

✨ Features

Working Principle

As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space. The predictor in I-JEPA can be seen as a primitive (and restricted) world - model that is able to model spatial uncertainty in a static image from a partially observable context. This world model is semantic in the sense that it predicts high - level information about unseen regions in the image, rather than pixel - level details.

We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches. The model correctly captures positional uncertainty and produces high - level object parts with the correct pose (e.g., dog’s head, wolf’s front legs).

![Illustrating how the predictor learns to model the semantics of the world](https://github.com/facebookresearch/ijepa/assets/7530871/9b66e461 - fc8b - 4b12 - 9f06 - 63ec4dfc1452)

Intended Uses & Limitations

I-JEPA can be used for image classification or feature extraction. This checkpoint in specific is intended for Feature Extraction.

💻 Usage Examples

Basic Usage

Here is how to use this model for image feature extraction:

import requests
from PIL import Image
from torch.nn.functional import cosine_similarity

from transformers import AutoModel, AutoProcessor

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)

model_id = "jmtzt/ijepa_vith14_1k"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)


def infer(image):
    inputs = processor(image, return_tensors="pt")
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)


embed_1 = infer(image_1)
embed_2 = infer(image_2)

similarity = cosine_similarity(embed_1, embed_2)
print(similarity)

📚 Documentation

BibTeX entry and citation info

If you use I-JEPA or this code in your work, please cite:

@article{assran2023self,
  title={Self - Supervised Learning from Images with a Joint - Embedding Predictive Architecture},
  author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご