P

PE Lang G14 448

Developed by facebook
The Perception Encoder is a state-of-the-art image and video understanding encoder trained through vision-language training, with strong generalization capabilities.
Downloads 247
Release Time : 4/11/2025

Model Overview

The Perception Encoder (PE) is a series of large-scale vision encoder models that excel in various visual tasks, achieving outstanding classification, retrieval, and downstream task generalization through contrastive pre-training and synthetic alignment video fine-tuning.

Model Features

Strong Generalization Capability
The features generated internally by PE have strong generalization capabilities and can be extended to various downstream tasks.
Language Alignment Optimization
The language version of PE is specially optimized for versatility, suitable for various scenarios in multimodal language modeling.
Outstanding Document Processing Capability
Performs exceptionally well in OCR and document tasks.

Model Capabilities

Image Understanding
Video Understanding
Document Question Answering
Information Question Answering
Text Question Answering
Multimodal Language Modeling

Use Cases

Document Processing
Document Question Answering
Used to answer questions based on document content
Achieved 94.6 accuracy on the test set
Visual Question Answering
Information Question Answering
Answer questions based on image or video content
Achieved 78.8 accuracy on the test set
Multimodal Understanding
Perception Testing
Evaluate the model's understanding of visual content
Achieved 82.7 accuracy on the test set
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase