Open-source V-JEPA 2 Video Understanding Model - Possessing industry-leading video understanding capabilities and free to use.

Vjepa2 Vitl Fpc64 256

Developed by facebook

V-JEPA 2 is a cutting-edge video understanding model developed by the FAIR team under Meta. It extends the pre-training objectives of VJEPA and has industry-leading video understanding capabilities.

Video Processing

Transformers

Open Source License:MIT #Video understanding #Self-supervised learning #Multimodal encoding

Downloads 109

Release Time : 5/31/2025

Model Overview

V-JEPA 2 is a powerful video understanding model that can be used for tasks such as video classification and retrieval. It can also serve as a video encoder for vision-language models (VLMs).

Model Features

Advanced video understanding capabilities

It extends the pre-training objectives of VJEPA and has industry-leading video understanding capabilities.

Multimodal processing

It can process both video and image data simultaneously.

Multifunctional application

It supports tasks such as video classification and retrieval and can also serve as a video encoder for vision-language models (VLMs).

Model Capabilities

Video understanding

Video classification

Video retrieval

Visual feature extraction

Use Cases

Video analysis

Video classification

Classify and identify video content.

Video retrieval

Retrieve similar videos based on content.

Multimodal application

Vision-language model encoder

Used as a video encoder for vision-language models.

🚀 V-JEPA 2

A cutting - edge video understanding model developed by FAIR, Meta, extending the pretraining objectives of VJEPA. It achieves state - of - the - art video understanding capabilities by leveraging large - scale data and models. The code is released in this repository.

🚀 Quick Start

V-JEPA 2 is a frontier video understanding model that can be used for video classification, retrieval, or as a video encoder for VLMs.

✨ Features

Extends the pretraining objectives of VJEPA.
Achieves state - of - the - art video understanding capabilities.
Leverages large - scale data and models.

📦 Installation

To run the V-JEPA 2 model, make sure you have installed the latest transformers:

pip install -U git+https://github.com/huggingface/transformers

💻 Usage Examples

Basic Usage

To use V-JEPA 2 for video and image processing, first load the model and the processor:

from transformers import AutoVideoProcessor, AutoModel

hf_repo = "facebook/vjepa2-vitl-fpc64-256"

model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

Advanced Usage

Loading a Video

To load a video, sample the number of frames according to the model. For this model, we use 64.

import torch
from torchcodec.decoders import VideoDecoder
import numpy as np

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
with torch.no_grad():
    video_embeddings = model.get_vision_features(**video)

print(video_embeddings.shape)

Loading an Image

To load an image, simply copy the image to the desired number of frames.

from transformers.image_utils import load_image

image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
pixel_values = processor(image, return_tensors="pt").to(model.device)["pixel_values_videos"]
pixel_values = pixel_values.repeat(1, 16, 1, 1, 1) # repeating image 16 times

with torch.no_grad():
    image_embeddings = model.get_vision_features(pixel_values)    

print(image_embeddings.shape)

For more code examples, please refer to the V-JEPA 2 documentation.

📄 License

This project is licensed under the MIT license.

Citation

@techreport{assran2025vjepa2,
  title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
  author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and
Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and
Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and
Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and
Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and
Rabbat, Michael and Ballas, Nicolas},
  institution={FAIR at Meta},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご