đ Perception Encoder
Perception Encoder (PE) is a state - of - the - art encoder for image and video understanding. It is trained via simple vision - language learning and offers excellent performance across various vision tasks.
đ Quick Start
You can start using the Perception Encoder by following the model loading steps provided in the GitHub repository:
git clone https://github.com/facebookresearch/perception_models.git
cd perception_models
conda create --name perception_models python=3.12
conda activate perception_models
# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
# We use torchcodec for decoding videos into PyTorch tensors
conda install ffmpeg -c conda - forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
pip install -e .
⨠Features
- State - of - the - art Performance: Perception Encoder (PE) shows outstanding performance on a wide range of vision tasks, including zero - shot image and video classification and retrieval.
- General Features: It can generate strong, general features that are useful for downstream tasks, and can transfer large - scale contrastive pretraining to downstream tasks through alignment tuning.
đ Documentation
Model Details
Perception Encoder: Core
PE core is the base model. It is trained with a robust image pretraining schedule and finetuned on data generated by a synthetic video data engine.
Model Configurations
PE core comes in 3 sizes. The main checkpoint is PE core G, and the L and B models are distilled from it.
Scale |
Tower |
Params |
Width |
Depth |
MLP |
Heads |
CLIP Dim |
Resolution / Context Len |
B/16 |
Vision |
0.09B |
768 |
12 |
3072 |
12 |
1024 |
224px |
|
Text |
0.31B |
1024 |
24 |
4096 |
16 |
1024 |
32 tokens |
L/14 |
Vision |
0.32B |
1024 |
24 |
4096 |
16 |
1024 |
336px |
|
Text |
0.31B |
1024 |
24 |
4096 |
16 |
1024 |
32 tokens |
G/14 |
Vision |
1.88B |
1536 |
50 |
8960 |
16 |
1280 |
448px |
|
Text |
0.47B |
1280 |
24 |
5120 |
20 |
1280 |
72 tokens |
All PE core models use an attention pooling block with 8 heads on top of the vision tower. The L and B models additionally have a class token for global aggregation. See the paper for more details.
Model Performance
PE core achieves extremely strong results in zero - shot image and video classification and retrieval.
Model |
Checkpoint |
IN - 1k |
IN - v2 |
IN - A |
ObjectNet |
COCO - T2I |
Kinetics - 400 |
VTT - T2I |
B/16 224px |
[PE - Core - B16 - 224](https://huggingface.co/facebook/PE - Core - B16 - 224) |
78.4 |
71.7 |
62.4 |
71.9 |
50.9 |
65.6 |
47.6 |
L/14 336px |
[PE - Core - L14 - 336](https://huggingface.co/facebook/PE - Core - L14 - 336) |
83.5 |
77.9 |
89.0 |
84.7 |
57.1 |
73.4 |
50.3 |
G/14 448px |
[PE - Core - G14 - 448](https://huggingface.co/facebook/PE - Core - G14 - 448) |
85.4 |
80.2 |
92.6 |
88.2 |
58.1 |
76.9 |
51.2 |
PE core performs particularly well on difficult benchmarks such as ObjectNet and ImageNet - A.
How to use
Model loading code
The model loading code is provided in https://github.com/facebookresearch/perception_models. The installation steps are shown in the "Quick Start" section.
Image and Text Feature extraction with a Trained Model
import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms
print("CLIP configs:", pe.CLIP.available_configs())
model = pe.CLIP.from_config("PE-Core-L14-336", pretrained=True)
model = model.cuda()
preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)
image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()
with torch.no_grad(), torch.autocast("cuda"):
image_features, text_features, logit_scale = model(image, text)
text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
You can find more details in the GitHub repo.
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
If you find our code useful for your research, please consider citing:
@article{bolya2025PerceptionEncoder,
title={Perception Encoder: The best visual embeddings are not at the output of the network},
author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
@article{cho2025PerceptionLM,
title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}