P

PE Lang L14 448

Developed by facebook
The Perception Encoder (PE) is an advanced image and video understanding encoder trained through vision-language learning, achieving state-of-the-art performance on various visual tasks.
Downloads 1,087
Release Time : 4/11/2025

Model Overview

The Perception Encoder (PE) is a series of large-scale visual encoding models that surpass existing models in classification and retrieval tasks through robust contrastive pre-training and fine-tuning on synthetically aligned videos, capable of generating highly generalizable features for downstream tasks.

Model Features

Powerful Visual Understanding
Achieves state-of-the-art performance on various visual tasks through contrastive pre-training and video fine-tuning.
Generalizable Feature Generation
Internally generates highly generalizable features for downstream tasks, surpassing traditional output layer features.
Language Alignment Capability
The language version of PE is specifically optimized for multimodal language modeling scenarios, excelling in OCR and document tasks.

Model Capabilities

Image Feature Extraction
Video Understanding
Multimodal Alignment
Document Understanding
OCR Task Processing

Use Cases

Document Processing
Document Question Answering
Handles document QA tasks such as Doc VQA
Achieves 94.6% accuracy on the Doc VQA test set
Information Extraction
Extracts key information from documents
Achieves 78.8% accuracy on the InfoQA test set
Visual Question Answering
Text-based Visual Question Answering
Answers questions based on text content in images
Achieves 86.5% accuracy on TextVQA
Video Understanding
Video Content Analysis
Understands video content and answers questions
Achieves 77.1% accuracy on MVBench
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase