开源V-JEPA 2视频理解模型 - 具备业界领先视频理解能力，免费使用

Home

Vjepa2 Vitl Fpc64 256

Developed by facebook

V-JEPA 2是Meta旗下FAIR团队开发的前沿视频理解模型，扩展了VJEPA的预训练目标，具备业界领先的视频理解能力。

视频处理

Transformers

Open Source License:MIT #视频理解 #自监督学习 #多模态编码

Downloads 109

Release Time : 5/31/2025

Model Overview

V-JEPA 2是一个强大的视频理解模型，可用于视频分类、检索等任务，也能作为视觉语言模型（VLM）的视频编码器。

Model Features

先进的视频理解能力

扩展了VJEPA的预训练目标，具备业界领先的视频理解能力。

多模态处理

可同时处理视频和图像数据。

多功能应用

支持视频分类、检索等任务，还能作为视觉语言模型（VLM）的视频编码器。

Model Capabilities

视频理解

视频分类

视频检索

视觉特征提取

Use Cases

视频分析

视频分类

对视频内容进行分类识别。

视频检索

基于内容检索相似视频。

多模态应用

视觉语言模型编码器

作为视觉语言模型的视频编码器使用。

🚀 V-JEPA 2

V-JEPA 2是由Meta旗下的FAIR团队开发的前沿视频理解模型。它扩展了VJEPA的预训练目标，借助大规模的数据和模型，实现了业界领先的视频理解能力。代码已在此仓库发布。

🚀 快速开始

V-JEPA 2是一个强大的视频理解模型，可用于视频分类、检索等任务，也能作为视觉语言模型（VLM）的视频编码器。

✨ 主要特性

扩展了VJEPA的预训练目标，具备先进的视频理解能力。
可处理视频和图像数据。
支持视频分类、检索等任务，还能作为VLM的视频编码器。

📦 安装指南

要运行V-JEPA 2模型，需确保安装了最新版本的transformers库：

pip install -U git+https://github.com/huggingface/transformers

💻 使用示例

基础用法

加载模型和处理器

from transformers import AutoVideoProcessor, AutoModel

hf_repo = "facebook/vjepa2-vitl-fpc64-256"

model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

加载视频

import torch
from torchcodec.decoders import VideoDecoder
import numpy as np

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
with torch.no_grad():
    video_embeddings = model.get_vision_features(**video)

print(video_embeddings.shape)

加载图像

from transformers.image_utils import load_image

image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
pixel_values = processor(image, return_tensors="pt").to(model.device)["pixel_values_videos"]
pixel_values = pixel_values.repeat(1, 16, 1, 1, 1) # repeating image 16 times

with torch.no_grad():
    image_embeddings = model.get_vision_features(pixel_values)    

print(image_embeddings.shape)

更多代码示例，请参考V-JEPA 2文档。

📄 许可证

本项目采用MIT许可证。

📚 引用

@techreport{assran2025vjepa2,
  title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
  author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and
Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and
Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and
Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and
Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and
Rabbat, Michael and Ballas, Nicolas},
  institution={FAIR at Meta},
  year={2025}
}