PE-Lang-G14-448 Open-source Perceptual Encoder - Empowering Image and Video Understanding with Exceptionally Strong Generalization Ability

PE Lang G14 448

Developed by facebook

The Perception Encoder is a state-of-the-art image and video understanding encoder trained through vision-language training, with strong generalization capabilities.

Text-to-Image Open Source License:Apache-2.0 #Multimodal Visual Understanding #Language Alignment Optimization #Document OCR Enhancement

Downloads 247

Release Time : 4/11/2025

Model Overview

The Perception Encoder (PE) is a series of large-scale vision encoder models that excel in various visual tasks, achieving outstanding classification, retrieval, and downstream task generalization through contrastive pre-training and synthetic alignment video fine-tuning.

Model Features

Strong Generalization Capability

The features generated internally by PE have strong generalization capabilities and can be extended to various downstream tasks.

Language Alignment Optimization

The language version of PE is specially optimized for versatility, suitable for various scenarios in multimodal language modeling.

Outstanding Document Processing Capability

Performs exceptionally well in OCR and document tasks.

Model Capabilities

Image Understanding

Video Understanding

Document Question Answering

Information Question Answering

Text Question Answering

Multimodal Language Modeling

Use Cases

Document Processing

Document Question Answering

Used to answer questions based on document content

Achieved 94.6 accuracy on the test set

Visual Question Answering

Information Question Answering

Answer questions based on image or video content

Achieved 78.8 accuracy on the test set

Multimodal Understanding

Perception Testing

Evaluate the model's understanding of visual content

Achieved 82.7 accuracy on the test set

🚀 Perception Encoder

Perception Encoder (PE) is a state - of - the - art encoder for image and video understanding. It's trained via simple vision - language learning and offers excellent performance across various vision tasks.

🚀 Quick Start

For model loading code, we provide it in this GitHub repository. You can find more details there.

✨ Features

State - of - the - art Performance: Perception Encoder (PE) is a family of large - scale vision encoder models. It outperforms all existing models on classification and retrieval tasks, and internally produces strong, general features for downstream tasks.
Versatile for Language Modeling: PE lang takes strong language performance from the intermediate layers of PE core and further aligns for language modeling. It's versatile for any multimodal language modeling use case, including using different language model decoders (e.g., Llama / Qwen) and different eval settings (e.g., native res / tiling). It performs well on OCR and document tasks.

📚 Documentation

Model Details

[📃 Tech Report] [📂 Github]

Perception Encoder (PE) was introduced in "Perception Encoder: The best visual embeddings are not at the output of the network".

Model Developer: Meta

Model Overview: Perception Encoder (PE) uses a robust contrastive pretraining recipe and finetuning on synthetically aligned videos. It unlocks the ability for large - scale contrastive pretraining to transfer to downstream tasks with alignment tuning.

Perception Encoder: Language

PE lang further aligns for language modeling following PLM. We release two PE Lang checkpoints, L14 - 448 and G14 - 448.

Here are their results in our benchmark setting with frozen encoder with 2.6M SFT datamix, using 448px only (i.e., with no tiling) and Llama 3.1 8B as the decoder:

Encoder	Checkpoint	Doc VQA (val)	InfoQA (val)	TextVQA	MVBench	PerceptionTest (val)	EgoSchema (val)
L/14 448px	[PE - Lang - L14 - 448](https://huggingface.co/facebook/PE - Lang - L14 - 448)	81.9	46.4	73.0	52.3	54.7	59.8
G/14 448px	[PE - Lang - G14 - 448](https://huggingface.co/facebook/PE - Lang - G14 - 448)	84.4	48.3	75.2	52.4	56.0	62.0

Here is a sample of the performance obtainable by using PE Core G aligned further with [PLM - 8B](https://huggingface.co/facebook/Perception - LM - 8B) (stage 3) using 36 + 1 image tiles / 32 video frames with Llama 3.1 8B as the decoder:

Model	Encoder	Doc VQA (test)	InfoQA (test)	TextVQA	MVBench	PerceptionTest (test)	EgoSchema (test)
PLM - 8B	[PE - Core - G14 - 448](https://huggingface.co/facebook/PE - Core - G14 - 448)*	94.6	78.8	86.5	77.1	82.7	68.8

* The PE - Core - G14 - 448 checkpoint was further trained using tiling. We will release the tiling aligned checkpoint soon.

See the paper for full performance evaluations and fair comparisons to other models.

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

If you find our code useful for your research, please consider citing:

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po - Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open - Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po - Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご