開源V-JEPA 2視頻理解模型 - 具備業界領先視頻理解能力，免費使用

首頁

Vjepa2 Vitl Fpc64 256

由facebook開發

V-JEPA 2是Meta旗下FAIR團隊開發的前沿視頻理解模型，擴展了VJEPA的預訓練目標，具備業界領先的視頻理解能力。

視頻處理

Transformers

開源協議:MIT #視頻理解 #自監督學習 #多模態編碼

下載量 109

發布時間 : 5/31/2025

模型概述

V-JEPA 2是一個強大的視頻理解模型，可用於視頻分類、檢索等任務，也能作為視覺語言模型（VLM）的視頻編碼器。

模型特點

先進的視頻理解能力

擴展了VJEPA的預訓練目標，具備業界領先的視頻理解能力。

多模態處理

可同時處理視頻和圖像數據。

多功能應用

支持視頻分類、檢索等任務，還能作為視覺語言模型（VLM）的視頻編碼器。

模型能力

視頻理解

視頻分類

視頻檢索

視覺特徵提取

使用案例

視頻分析

視頻分類

對視頻內容進行分類識別。

視頻檢索

基於內容檢索相似視頻。

多模態應用

視覺語言模型編碼器

作為視覺語言模型的視頻編碼器使用。

🚀 V-JEPA 2

V-JEPA 2是由Meta旗下的FAIR團隊開發的前沿視頻理解模型。它擴展了VJEPA的預訓練目標，藉助大規模的數據和模型，實現了業界領先的視頻理解能力。代碼已在此倉庫發佈。

🚀 快速開始

V-JEPA 2是一個強大的視頻理解模型，可用於視頻分類、檢索等任務，也能作為視覺語言模型（VLM）的視頻編碼器。

✨ 主要特性

擴展了VJEPA的預訓練目標，具備先進的視頻理解能力。
可處理視頻和圖像數據。
支持視頻分類、檢索等任務，還能作為VLM的視頻編碼器。

📦 安裝指南

要運行V-JEPA 2模型，需確保安裝了最新版本的transformers庫：

pip install -U git+https://github.com/huggingface/transformers

💻 使用示例

基礎用法

加載模型和處理器

from transformers import AutoVideoProcessor, AutoModel

hf_repo = "facebook/vjepa2-vitl-fpc64-256"

model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

加載視頻

import torch
from torchcodec.decoders import VideoDecoder
import numpy as np

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
with torch.no_grad():
    video_embeddings = model.get_vision_features(**video)

print(video_embeddings.shape)

加載圖像

from transformers.image_utils import load_image

image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
pixel_values = processor(image, return_tensors="pt").to(model.device)["pixel_values_videos"]
pixel_values = pixel_values.repeat(1, 16, 1, 1, 1) # repeating image 16 times

with torch.no_grad():
    image_embeddings = model.get_vision_features(pixel_values)    

print(image_embeddings.shape)

更多代碼示例，請參考V-JEPA 2文檔。

📄 許可證

本項目採用MIT許可證。

📚 引用

@techreport{assran2025vjepa2,
  title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
  author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and
Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and
Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and
Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and
Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and
Rabbat, Michael and Ballas, Nicolas},
  institution={FAIR at Meta},
  year={2025}
}