PE-Core-L14-336 Open-source Visual Encoder Model - Empowering Various Visual Tasks to Reach Advanced Levels

PE Core L14 336

Developed by facebook

A large-scale visual encoder model developed by Meta, achieving state-of-the-art performance in various vision tasks through contrastive pre-training and fine-tuning on synthetic video data

Text-to-Image Open Source License:Apache-2.0 #Zero-shot Visual Understanding #Multimodal Contrastive Learning #High-resolution Image Processing

Downloads 11.52k

Release Time : 4/11/2025

Model Overview

The Perception Encoder is a series of advanced image and video understanding encoders that employ a robust contrastive pre-training scheme and are fine-tuned on synthetically aligned videos. It outperforms existing models in classification and retrieval tasks, with internally generated features exhibiting strong generalization capabilities.

Model Features

Internal Feature Generalization

The internally generated features possess strong generalization capabilities, extendable to various downstream tasks.

Alignment Tuning Technology

Unlocks the transfer potential of large-scale contrastive pre-training through alignment tuning, fully leveraging universal features.

Multi-scale Performance

Offers three scales (B/16, L/14, G/14) to meet different computational needs.

Model Capabilities

Zero-shot image classification

Zero-shot video classification

Image-text retrieval

Video-text retrieval

Cross-modal feature extraction

Use Cases

Visual Content Understanding

Image Classification

Accurately classifies images without fine-tuning.

Achieves 85.4% accuracy on ImageNet-1k.

Cross-modal Retrieval

Enables efficient retrieval between images/videos and text.

Achieves 58.1% recall on COCO-T2I.

Video Analysis

Video Action Recognition

Identifies action categories in videos.

Achieves 76.9% accuracy on Kinetics-400.

🚀 Perception Encoder

Perception Encoder (PE) is a state - of - the - art encoder for image and video understanding. It is trained via simple vision - language learning and offers excellent performance across various vision tasks.

🚀 Quick Start

You can start using the Perception Encoder by following the model loading steps provided in the GitHub repository:

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models
conda create --name perception_models python=3.12
conda activate perception_models
# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
# We use torchcodec for decoding videos into PyTorch tensors
conda install ffmpeg -c conda - forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
pip install -e .

✨ Features

State - of - the - art Performance: Perception Encoder (PE) shows outstanding performance on a wide range of vision tasks, including zero - shot image and video classification and retrieval.
General Features: It can generate strong, general features that are useful for downstream tasks, and can transfer large - scale contrastive pretraining to downstream tasks through alignment tuning.

📚 Documentation

Model Details

Model Developer: Meta
Tech Report: [📃 Tech Report]
Github: [📂 Github]
Paper: "Perception Encoder: The best visual embeddings are not at the output of the network"

Perception Encoder: Core

PE core is the base model. It is trained with a robust image pretraining schedule and finetuned on data generated by a synthetic video data engine.

Model Configurations

PE core comes in 3 sizes. The main checkpoint is PE core G, and the L and B models are distilled from it.

Scale	Tower	Params	Width	Depth	MLP	Heads	CLIP Dim	Resolution / Context Len
B/16	Vision	0.09B	768	12	3072	12	1024	224px
	Text	0.31B	1024	24	4096	16	1024	32 tokens
L/14	Vision	0.32B	1024	24	4096	16	1024	336px
	Text	0.31B	1024	24	4096	16	1024	32 tokens
G/14	Vision	1.88B	1536	50	8960	16	1280	448px
	Text	0.47B	1280	24	5120	20	1280	72 tokens

All PE core models use an attention pooling block with 8 heads on top of the vision tower. The L and B models additionally have a class token for global aggregation. See the paper for more details.

Model Performance

PE core achieves extremely strong results in zero - shot image and video classification and retrieval.

Model	Checkpoint	IN - 1k	IN - v2	IN - A	ObjectNet	COCO - T2I	Kinetics - 400	VTT - T2I
B/16 224px	[PE - Core - B16 - 224](https://huggingface.co/facebook/PE - Core - B16 - 224)	78.4	71.7	62.4	71.9	50.9	65.6	47.6
L/14 336px	[PE - Core - L14 - 336](https://huggingface.co/facebook/PE - Core - L14 - 336)	83.5	77.9	89.0	84.7	57.1	73.4	50.3
G/14 448px	[PE - Core - G14 - 448](https://huggingface.co/facebook/PE - Core - G14 - 448)	85.4	80.2	92.6	88.2	58.1	76.9	51.2

PE core performs particularly well on difficult benchmarks such as ObjectNet and ImageNet - A.

How to use

Model loading code

The model loading code is provided in https://github.com/facebookresearch/perception_models. The installation steps are shown in the "Quick Start" section.

Image and Text Feature extraction with a Trained Model

import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

print("CLIP configs:", pe.CLIP.available_configs())
# CLIP configs: ['PE-Core-G14-448', 'PE-Core-L14-336', 'PE-Core-B16-224']

model = pe.CLIP.from_config("PE-Core-L14-336", pretrained=True)  # Downloads from HF
model = model.cuda()

preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)

image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features, text_features, logit_scale = model(image, text)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[0.0, 0.0, 1.0]]

You can find more details in the GitHub repo.

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

If you find our code useful for your research, please consider citing:

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご