P

PE Core B16 224

Developed by facebook
The Perception Encoder is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.
Downloads 9,663
Release Time : 4/17/2025

Model Overview

The Perception Encoder is a series of large-scale visual encoder models that not only surpass existing models in classification and retrieval tasks but also generate robust, general-purpose features suitable for downstream tasks, thanks to robust contrastive pre-training and fine-tuning on synthetically aligned videos.

Model Features

Strong Zero-shot Capabilities
Excels comprehensively in zero-shot image classification and retrieval tasks, particularly standing out on challenging benchmarks like ObjectNet and ImageNet-A.
Multi-task Adaptability
Suitable for various downstream vision tasks, including image and video understanding, through internally generated general-purpose features.
Multi-scale Models
Offers three scalesโ€”B/16, L/14, G/14โ€”to meet different computational resource and performance needs.
Synthetic Data Fine-tuning
Fine-tuned on data generated by synthetic video data engines, enhancing the model's generalization capabilities.

Model Capabilities

Zero-shot image classification
Zero-shot image retrieval
Zero-shot video classification
Zero-shot video retrieval
Visual feature extraction
Text feature extraction
Cross-modal alignment

Use Cases

Image Understanding
Image Classification
Classify images without specific training
Achieves 85.4% accuracy on ImageNet-1k
Image Retrieval
Retrieve relevant images based on text queries
Achieves 58.1% accuracy on COCO-T2I
Video Understanding
Video Classification
Classify videos without specific training
Achieves 76.9% accuracy on Kinetics-400
Video Retrieval
Retrieve relevant video clips based on text queries
Achieves 51.2% accuracy on VTT-T2I
Featured Recommended AI Models
ยฉ 2025AIbase