PE - Spatial - G14 - 448 Open - Source Perception Encoder - Empowering Image and Video Understanding Applications

PE Spatial G14 448

Developed by facebook

The Perception Encoder (PE) is a state-of-the-art image and video understanding encoder trained through simple vision-language learning.

Open Source License:Apache-2.0 #Multi-task Visual Understanding #Intermediate Feature Extraction #Dense Prediction Optimization

Downloads 3,256

Release Time : 4/11/2025

Model Overview

The Perception Encoder (PE) is a series of large-scale vision encoder models that achieve state-of-the-art performance across various vision tasks. By employing a robust contrastive pre-training scheme and fine-tuning on synthetically aligned videos, PE not only surpasses all existing models in classification and retrieval tasks but also generates powerful, generalizable features internally that can be extended for downstream tasks.

Model Features

Intermediate Feature Extraction

Extracts powerful features from intermediate layers of the model rather than the output layer, providing superior visual embeddings.

SAM Optimization

Optimized using SAM 2.1's mask-based learning strategy to enhance performance in dense prediction tasks.

Fine Semantic Correspondence

The feature space exhibits fine semantic correspondences, enabling the identification of relationships between object parts.

Model Capabilities

Image feature extraction

Dense prediction task processing

Semantic correspondence analysis

Visual understanding

Use Cases

Computer Vision

Image Classification

Used for image classification tasks

Achieves state-of-the-art performance across various vision tasks

Object Detection

Used for dense prediction tasks such as object detection

Performs exceptionally well on ADE20k, LVIS, and COCO datasets

🚀 Perception Encoder

Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding, offering excellent performance in various vision tasks through simple vision-language learning.

🚀 Quick Start

You can find the model loading code in GitHub. More details are available in the GitHub repository.

✨ Features

Overall Model Features

Perception Encoder (PE) is a family of large-scale vision encoder models with state-of-the-art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large-scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.

Perception Encoder: Spatial

PE spatial similarly takes the strong spatial performance from the intermediate layers of PE core and aligns it to the end using a simple frozen teacher self-distillation loss and further refines with a novel SAM 2.1 mask-based learning strategy. PE spatial performs well on dense prediction tasks such as detection.

Despite being a short finetuning step using PE core's intermediate layers as a teacher (a pure CLIP model with a global loss) plus a little bit of refinement with SAM, the resulting feature space is quite detailed and well-aligned. Here we picture the PCA of the last layer features mapped to LCh color space (see the Tech Report for more details).

PE spatial also has nuanced semantic correspondences between objects thanks to its CLIP pretraining. It can show correspondence between parts like the first image cats' heads, backs, and legs. Additionally, PE spatial can show more nuanced correspondences like for the last two images, where the red/blue directions still denote parts, but the lightness/darkness directions now indicate semantics (i.e., dog/cat breed).

We release one checkpoint for PE spatial so far:

Encoder	Checkpoint	ADE20k Linear Probe 448px w/o TTA	LVIS Mask R-CNN 1024px Box / Mask mAP	COCO DETA 1728px Box mAP
G/14 448px	PE-Spatial-G14-448	49.3	54.2 / 49.3	65.5

See Tech Report for full set of evaluations and fair comparison to other works.

📚 Documentation

Model Details

[📃 Tech Report]
[📂 Github]

The Perception Encoder was introduced in "Perception Encoder: The best visual embeddings are not at the output of the network".

Model Developer: Meta

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

If you find our code useful for your research, please consider citing:

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご