ijepa_vitg16_22k Open-source Model - Self-supervised Learning, Predicting Image Partial Representations, Practical and Efficient!

Ijepa Vitg16 22k

Developed by facebook

I-JEPA is a self-supervised learning method that predicts representations of other parts of an image from partial representations, without relying on manual data transformations or filling in pixel-level details.

Image Classification

Transformers

#Self-supervised learning #Image representation prediction #Latent space prediction

Downloads 14

Release Time : 8/26/2024

Model Overview

The I-JEPA model is designed for image feature extraction, using a latent space predictor instead of a pixel decoder to model spatial uncertainty in static images from partially observable contexts.

Model Features

Self-supervised learning

Does not rely on predefined manual data transformation invariance, avoiding learning representations with less semantic information.

Latent space prediction

Uses a latent space predictor instead of a pixel decoder to predict high-level information of unseen image regions rather than pixel-level details.

Semantic modeling

Can accurately capture positional uncertainty and generate high-level object parts with correct poses.

Model Capabilities

Image feature extraction

Image classification

Use Cases

Computer vision

Image similarity calculation

Calculates the similarity between different images by extracting image features.

Can accurately reflect the semantic similarity between images.

Image classification

Uses extracted features for image classification tasks.

🚀 I-JEPA Model (Giant, fine-tuned on IN22K)

I-JEPA is a self - supervised learning method. It predicts the representations of part of an image from those of other parts of the same image. This approach avoids relying on pre - specified invariances for hand - crafted data transformations, which can be biased for specific downstream tasks. It also doesn't require the model to fill in pixel - level details, which often lead to learning less semantically meaningful representations.

ijepa

🚀 Quick Start

✨ Features

I-JEPA is a self - supervised learning method. It predicts image representations without relying on pre - specified invariances for data transformations and without filling in pixel - level details. It has a predictor working in latent space and can be used for image classification or feature extraction.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is how to use this model for image feature extraction:

import requests
from PIL import Image
from torch.nn.functional import cosine_similarity

from transformers import AutoModel, AutoProcessor

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)

model_id = "jmtzt/ijepa_vitg16_22k"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)


def infer(image):
    inputs = processor(image, return_tensors="pt")
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)


embed_1 = infer(image_1)
embed_2 = infer(image_2)

similarity = cosine_similarity(embed_1, embed_2)
print(similarity)

📚 Documentation

How does it work?

As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space. The predictor in I-JEPA can be seen as a primitive (and restricted) world - model that is able to model spatial uncertainty in a static image from a partially observable context. This world model is semantic in the sense that it predicts high - level information about unseen regions in the image, rather than pixel - level details.

We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches. The model correctly captures positional uncertainty and produces high - level object parts with the correct pose (e.g., dog’s head, wolf’s front legs).

Illustrating how the predictor learns to model the semantics of the world

Intended uses & limitations

I-JEPA can be used for image classification or feature extraction. This checkpoint in specific is intended for Feature Extraction.

🔧 Technical Details

No technical details section with more than 50 - word specific descriptions are provided in the original document, so this section is skipped.

📄 License

The license for this project is cc - by - nc - 4.0.

BibTeX entry and citation info

If you use I-JEPA or this code in your work, please cite:

@article{assran2023self,
  title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
  author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}

Property	Details
Model Type	I-JEPA Model (Giant, fine - tuned on IN22K)
Training Data	timm/imagenet - 22k - wds
Library Name	transformers
License	cc - by - nc - 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご