PE-Core-B16-224 Open-Source Image and Video Understanding Encoder - A Must-Have for Advanced Performance Perception in Multi-Visual Tasks

PE Core B16 224

Developed by facebook

The Perception Encoder is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.

Text-to-Image Open Source License:Apache-2.0 #Zero-shot Visual Understanding #Multimodal Contrastive Learning #High-resolution Image Processing

Downloads 9,663

Release Time : 4/17/2025

Model Overview

The Perception Encoder is a series of large-scale visual encoder models that not only surpass existing models in classification and retrieval tasks but also generate robust, general-purpose features suitable for downstream tasks, thanks to robust contrastive pre-training and fine-tuning on synthetically aligned videos.

Model Features

Strong Zero-shot Capabilities

Excels comprehensively in zero-shot image classification and retrieval tasks, particularly standing out on challenging benchmarks like ObjectNet and ImageNet-A.

Multi-task Adaptability

Suitable for various downstream vision tasks, including image and video understanding, through internally generated general-purpose features.

Multi-scale Models

Offers three scales—B/16, L/14, G/14—to meet different computational resource and performance needs.

Synthetic Data Fine-tuning

Fine-tuned on data generated by synthetic video data engines, enhancing the model's generalization capabilities.

Model Capabilities

Zero-shot image classification

Zero-shot image retrieval

Zero-shot video classification

Zero-shot video retrieval

Visual feature extraction

Text feature extraction

Cross-modal alignment

Use Cases

Image Understanding

Image Classification

Classify images without specific training

Achieves 85.4% accuracy on ImageNet-1k

Image Retrieval

Retrieve relevant images based on text queries

Achieves 58.1% accuracy on COCO-T2I

Video Understanding

Video Classification

Classify videos without specific training

Achieves 76.9% accuracy on Kinetics-400

Video Retrieval

Retrieve relevant video clips based on text queries

Achieves 51.2% accuracy on VTT-T2I

🚀 Perception Encoder

Perception Encoder (PE) is a state - of - the - art encoder for image and video understanding. It's trained via simple vision - language learning, offering excellent performance across various vision tasks.

🚀 Quick Start

The Perception Encoder provides a powerful solution for image and video understanding. You can start using it by following the steps below.

✨ Features

State - of - the - art Performance: Achieves outstanding results in zero - shot image and video classification and retrieval.
General Features: Internally produces strong, general features that can be scaled for downstream tasks.
Multiple Model Sizes: Comes in different sizes (B/16, L/14, G/14) to meet various requirements.

📦 Installation

We provide the model loading code in GitHub Repo.

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models
conda create --name perception_models python=3.12
conda activate perception_models
# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
# We use torchcodec for decoding videos into PyTorch tensors
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
pip install -e .

This will install an editable version of the repo, allowing you to make changes to the code without needing to reinstall the package every time.

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

print("CLIP configs:", pe.CLIP.available_configs())
# CLIP configs: ['PE-Core-G14-448', 'PE-Core-L14-336', 'PE-Core-B16-224']

model = pe.CLIP.from_config("PE-Core-B16-224", pretrained=True)  # Downloads from HF
model = model.cuda()

preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)

image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features, text_features, logit_scale = model(image, text)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[0.0, 0.0, 1.0]]

You can find more details in the GitHub repo.

📚 Documentation

Model Details

📃 Tech Report 📂 Github

Perception Encoder (PE) is a state - of - the - art encoder for image and video understanding trained via simple vision - language learning. It was introduced in "Perception Encoder: The best visual embeddings are not at the output of the network".

Model Developer: Meta

Model Overview: Perception Encoder (PE) is a family of large - scale vision encoder models with state - of - the - art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large - scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.

Model Configurations

PE core currently comes in 3 sizes. PE core G is the main checkpoint, with L and B models distilled from it.

Scale	Tower	Params	Width	Depth	MLP	Heads	CLIP Dim	Resolution / Context Len
B/16	Vision	0.09B	768	12	3072	12	1024	224px
	Text	0.31B	1024	24	4096	16	1024	32 tokens
L/14	Vision	0.32B	1024	24	4096	16	1024	336px
	Text	0.31B	1024	24	4096	16	1024	32 tokens
G/14	Vision	1.88B	1536	50	8960	16	1280	448px
	Text	0.47B	1280	24	5120	20	1280	72 tokens

All PE core models use an attention pooling block with 8 heads on top of the vision tower. The L and B models additionally have a class token for global aggregation. See the paper for more details.

Model Performance

PE core obtains extremely strong results across the board on zero - shot image classification and retrieval as well as zero - shot video classification and retrieval. We present a sample of its performance across those domains below.

Model	Checkpoint	IN - 1k	IN - v2	IN - A	ObjectNet	COCO - T2I	Kinetics - 400	VTT - T2I
B/16 224px	[PE - Core - B16 - 224](https://huggingface.co/facebook/PE - Core - B16 - 224)	78.4	71.7	62.4	71.9	50.9	65.6	47.6
L/14 336px	[PE - Core - L14 - 336](https://huggingface.co/facebook/PE - Core - L14 - 336)	83.5	77.9	89.0	84.7	57.1	73.4	50.3
G/14 448px	[PE - Core - G14 - 448](https://huggingface.co/facebook/PE - Core - G14 - 448)	85.4	80.2	92.6	88.2	58.1	76.9	51.2

PE core performs particularly well on the hard benchmarks such as ObjectNet and ImageNet - A.

📄 License

This project is licensed under the Apache 2.0 license.

📖 Citation

If you find our code useful for your research, please consider citing:

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po - Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open - Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po - Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご