PE-Core-G14-448開源圖像與視頻理解編碼器

首頁

PE Core G14 448

由facebook開發

感知編碼器（PE）是通過簡單視覺-語言學習訓練出的最先進的圖像與視頻理解編碼器，在多種視覺任務上均達到最先進性能。

文本生成圖像開源協議:Apache-2.0 #零樣本視覺理解 #多模態對比學習 #高精度圖像分類

下載量 22.83k

發布時間 : 4/11/2025

模型概述

感知編碼器（PE）是一系列大規模視覺編碼器模型，採用魯棒的對比預訓練方案並在合成對齊視頻上微調，不僅在分類和檢索任務上超越現有所有模型，其內部還能生成適用於下游任務的強通用特徵。

模型特點

強大的零樣本能力

在零樣本圖像分類/檢索以及零樣本視頻分類/檢索任務中均取得極強性能

內部特徵通用性強

模型內部能生成適用於多種下游任務的強通用特徵

困難基準表現突出

在ObjectNet和ImageNet-A等困難基準測試中表現尤為突出

模型能力

零樣本圖像分類

零樣本圖像檢索

零樣本視頻分類

零樣本視頻檢索

視覺特徵提取

文本特徵提取

使用案例

圖像理解

圖像分類

無需微調即可對新圖像進行分類

在ImageNet-1k上達到85.4%準確率

圖像檢索

根據文本查詢檢索相關圖像

在COCO文本到圖像檢索上達到58.1%準確率

視頻理解

視頻分類

無需微調即可對新視頻進行分類

在Kinetics-400上達到76.9%準確率

視頻檢索

根據文本查詢檢索相關視頻片段

在VTT文本到視頻檢索上達到51.2%準確率

🚀 感知編碼器 (Perception Encoder)

感知編碼器（Perception Encoder，PE）是一種通過簡單的視覺 - 語言學習訓練的、用於圖像和視頻理解的先進編碼器。它能在多種視覺任務中展現出卓越性能，為下游任務提供強大且通用的特徵。

🚀 快速開始

代碼庫安裝

我們在 GitHub 上提供了預訓練代碼。你可以按照以下步驟進行安裝：

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models

conda create --name perception_models python=3.12
conda activate perception_models

# 安裝 PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

# 我們使用 torchcodec 將視頻解碼為 PyTorch 張量
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124

pip install -e .

這樣會安裝一個可編輯版本的倉庫，允許你對代碼進行修改，而無需每次都重新安裝包。

圖像和文本特徵提取

以下是使用訓練好的模型進行圖像和文本特徵提取的示例代碼：

import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

print("CLIP 配置:", pe.CLIP.available_configs())
# CLIP 配置: ['PE-Core-G14-448', 'PE-Core-L14-336', 'PE-Core-B16-224']

model = pe.CLIP.from_config("PE-Core-G14-448", pretrained=True)  # 從 Hugging Face 下載
model = model.cuda()

preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)

image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features, text_features, logit_scale = model(image, text)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("標籤概率:", text_probs)  # 輸出: [[0.0, 0.0, 1.0]]

你可以在 GitHub 倉庫中找到更多詳細信息。

✨ 主要特性

先進性能：感知編碼器（PE）是一系列大規模視覺編碼器模型，在各種視覺任務中具有先進的性能。
強大特徵：通過使用強大的對比預訓練方法並在合成對齊視頻上進行微調，PE 不僅在分類和檢索任務上優於所有現有模型，還能在內部產生強大且通用的特徵，適用於下游任務。
廣泛應用：在零樣本圖像分類和檢索以及零樣本視頻分類和檢索等任務中都能取得出色的結果。

📦 安裝指南

代碼庫安裝

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models

conda create --name perception_models python=3.12
conda activate perception_models

# 安裝 PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

# 我們使用 torchcodec 將視頻解碼為 PyTorch 張量
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124

pip install -e .

💻 使用示例

基礎用法

import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

print("CLIP 配置:", pe.CLIP.available_configs())
# CLIP 配置: ['PE-Core-G14-448', 'PE-Core-L14-336', 'PE-Core-B16-224']

model = pe.CLIP.from_config("PE-Core-G14-448", pretrained=True)  # 從 Hugging Face 下載
model = model.cuda()

preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)

image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features, text_features, logit_scale = model(image, text)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("標籤概率:", text_probs)  # 輸出: [[0.0, 0.0, 1.0]]

📚 詳細文檔

模型詳情

[📃 技術報告] [📂 Github]

感知編碼器（PE）在論文 "Perception Encoder: The best visual embeddings are not at the output of the network" 中被提出。

模型開發者：Meta

模型概述：感知編碼器（PE）是一系列大規模視覺編碼器模型，在各種視覺任務中具有先進的性能。通過使用強大的對比預訓練方法並在合成對齊視頻上進行微調，PE 不僅在分類和檢索任務上優於所有現有模型，還能在內部產生強大且通用的特徵，適用於下游任務。PE 開啟了大規模對比預訓練向需要對齊調整的下游任務遷移的能力，以利用這些通用特徵。

感知編碼器：核心

PE 核心是我們的基礎模型，使用強大的圖像預訓練計劃進行訓練，並在我們的合成視頻數據引擎生成的數據上進行微調。

模型配置

PE 核心目前有 3 種尺寸。PE 核心 G 是主要的檢查點，L 和 B 模型是從它蒸餾而來的。

規模	塔	參數	寬度	深度	MLP	頭數	CLIP 維度	分辨率 / 上下文長度
B/16	視覺	0.09B	768	12	3072	12	1024	224px
	文本	0.31B	1024	24	4096	16	1024	32 個標記
L/14	視覺	0.32B	1024	24	4096	16	1024	336px
	文本	0.31B	1024	24	4096	16	1024	32 個標記
G/14	視覺	1.88B	1536	50	8960	16	1280	448px
	文本	0.47B	1280	24	5120	20	1280	72 個標記

所有 PE 核心模型在視覺塔頂部使用一個具有 8 個頭的注意力池化塊。L 和 B 模型還額外有一個用於全局聚合的類標記。更多詳細信息請參閱論文。

模型性能

PE 核心在零樣本圖像分類和檢索以及零樣本視頻分類和檢索等任務中都取得了非常出色的結果。以下是其在這些領域的部分性能表現：

模型	檢查點	IN-1k	IN-v2	IN-A	ObjectNet	COCO-T2I	Kinetics-400	VTT-T2I
B/16 224px	PE-Core-B16-224	78.4	71.7	62.4	71.9	50.9	65.6	47.6
L/14 336px	PE-Core-L14-336	83.5	77.9	89.0	84.7	57.1	73.4	50.3
G/14 448px	PE-Core-G14-448	85.4	80.2	92.6	88.2	58.1	76.9	51.2

PE 核心在 ObjectNet 和 ImageNet - A 等“困難”基準測試中表現尤其出色。

📄 許可證

本項目採用 Apache - 2.0 許可證。

📖 引用

如果你發現我們的代碼對您的研究有用，請考慮引用以下論文：

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}