đ I-JEPA Model (Huge, fine-tuned on IN22K)
I-JEPA is a self - supervised learning method. It predicts the representations of part of an image from those of other parts of the same image, without relying on pre - specified invariances for data transformations and without filling in pixel - level details, thus learning more semantically meaningful representations.

đ Quick Start
I-JEPA is a self - supervised learning method. It predicts image representations in a novel way, avoiding the drawbacks of traditional methods.
⨠Features
- Self - supervised learning: Predicts image representations without relying on pre - specified invariances and pixel - level filling.
- Semantic prediction: Focuses on high - level information rather than pixel - level details.
- Applicable for multiple tasks: Can be used for image classification and feature extraction.
đ Documentation
How does it work?
Unlike generative methods with pixel decoders, I-JEPA has a predictor operating in latent space. This predictor can be regarded as a primitive world - model that can handle spatial uncertainty in static images from partial contexts. It predicts high - level information about unseen regions in the image.
We trained a stochastic decoder to map the predicted representations back to pixel space as sketches. The model can capture positional uncertainty and generate high - level object parts with correct poses.

Intended uses & limitations
I-JEPA can be used for image classification or feature extraction. This specific checkpoint is for Feature Extraction.
đģ Usage Examples
Basic Usage
Here is how to use this model for image feature extraction:
import requests
from PIL import Image
from torch.nn.functional import cosine_similarity
from transformers import AutoModel, AutoProcessor
url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
model_id = "jmtzt/ijepa_vith14_22k"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
def infer(image):
inputs = processor(image, return_tensors="pt")
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1)
embed_1 = infer(image_1)
embed_2 = infer(image_2)
similarity = cosine_similarity(embed_1, embed_2)
print(similarity)
BibTeX entry and citation info
If you use I-JEPA or this code in your work, please cite:
@article{assran2023self,
title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
journal={arXiv preprint arXiv:2301.08243},
year={2023}
}
đ License
This project is licensed under the cc-by-nc-4.0
license.
đ Information Table
Property |
Details |
Datasets |
timm/imagenet-22k-wds |
Library Name |
transformers |
License |
cc-by-nc-4.0 |