🚀 V-JEPA 2
A cutting - edge video understanding model developed by FAIR, Meta, extending the pretraining objectives of VJEPA. It achieves state - of - the - art video understanding capabilities by leveraging large - scale data and models. The code is released in this repository.
🚀 Quick Start
V-JEPA 2 is a frontier video understanding model that can be used for video classification, retrieval, or as a video encoder for VLMs.
✨ Features
- Extends the pretraining objectives of VJEPA.
- Achieves state - of - the - art video understanding capabilities.
- Leverages large - scale data and models.
📦 Installation
To run the V-JEPA 2 model, make sure you have installed the latest transformers
:
pip install -U git+https://github.com/huggingface/transformers
💻 Usage Examples
Basic Usage
To use V-JEPA 2 for video and image processing, first load the model and the processor:
from transformers import AutoVideoProcessor, AutoModel
hf_repo = "facebook/vjepa2-vitl-fpc64-256"
model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)
Advanced Usage
Loading a Video
To load a video, sample the number of frames according to the model. For this model, we use 64.
import torch
from torchcodec.decoders import VideoDecoder
import numpy as np
video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64)
video = vr.get_frames_at(indices=frame_idx).data
video = processor(video, return_tensors="pt").to(model.device)
with torch.no_grad():
video_embeddings = model.get_vision_features(**video)
print(video_embeddings.shape)
Loading an Image
To load an image, simply copy the image to the desired number of frames.
from transformers.image_utils import load_image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
pixel_values = processor(image, return_tensors="pt").to(model.device)["pixel_values_videos"]
pixel_values = pixel_values.repeat(1, 16, 1, 1, 1)
with torch.no_grad():
image_embeddings = model.get_vision_features(pixel_values)
print(image_embeddings.shape)
For more code examples, please refer to the V-JEPA 2 documentation.
📄 License
This project is licensed under the MIT license.
Citation
@techreport{assran2025vjepa2,
title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and
Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and
Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and
Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and
Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and
Rabbat, Michael and Ballas, Nicolas},
institution={FAIR at Meta},
year={2025}
}