🚀 I-JEPA Model (Huge, fine-tuned on IN1K)
I-JEPA is a self - supervised learning method. It predicts the representations of part of an image from those of other parts of the same image. This approach has two key advantages:
- It doesn't rely on pre - specified invariances to hand - crafted data transformations, which can be biased for specific downstream tasks.
- It avoids having the model fill in pixel - level details, which often leads to learning less semantically meaningful representations.

🚀 Quick Start
✨ Features
- Self - supervised Learning: I-JEPA is a self - supervised learning method that predicts image representations without relying on pre - defined data transformation invariances or pixel - level filling.
- Latent Space Prediction: Instead of a pixel decoder like generative methods, I-JEPA has a predictor that makes predictions in latent space.
- Semantic World Model: The predictor in I-JEPA acts as a semantic world model, predicting high - level information about unseen image regions.
📦 Installation
The installation details are not provided in the original document.
💻 Usage Examples
Basic Usage
Here is how to use this model for image feature extraction:
import requests
from PIL import Image
from torch.nn.functional import cosine_similarity
from transformers import AutoModel, AutoProcessor
url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
model_id = "jmtzt/ijepa_vith16_1k"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
def infer(image):
inputs = processor(image, return_tensors="pt")
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1)
embed_1 = infer(image_1)
embed_2 = infer(image_2)
similarity = cosine_similarity(embed_1, embed_2)
print(similarity)
📚 Documentation
How does it work?
As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space.
The predictor in I-JEPA can be seen as a primitive (and restricted) world - model that is able to model spatial uncertainty in a static image from a partially observable context.
This world model is semantic in the sense that it predicts high level information about unseen regions in the image, rather than pixel - level details.
We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches.
The model correctly captures positional uncertainty and produces high - level object parts with the correct pose (e.g., dog’s head, wolf’s front legs).

Intended uses & limitations
I-JEPA can be used for image classification or feature extraction. This checkpoint in specific is intended for Feature Extraction.
🔧 Technical Details
The I-JEPA method predicts image representations in a self - supervised way. It uses a predictor in latent space instead of a pixel decoder. The stochastic decoder maps the predicted latent representations back to pixel space as sketches, enabling the model to capture high - level semantic information.
📄 License
The license for this project is cc-by-nc-4.0
.
BibTeX entry and citation info
If you use I-JEPA or this code in your work, please cite:
@article{assran2023self,
title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
journal={arXiv preprint arXiv:2301.08243},
year={2023}
}
Information Table
Property |
Details |
Model Type |
I-JEPA Model (Huge, fine - tuned on IN1K) |
Training Data |
ILSVRC/imagenet - 1k |
Library Name |
transformers |
License |
cc - by - nc - 4.0 |