# Zero-shot Visual Understanding

PE Core B16 224
Apache-2.0
The Perception Encoder is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.
Text-to-Image
P
facebook
9,663
11
PE Core G14 448
Apache-2.0
The Perception Encoder (PE) is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.
Text-to-Image
P
facebook
22.83k
14
PE Core L14 336
Apache-2.0
A large-scale visual encoder model developed by Meta, achieving state-of-the-art performance in various vision tasks through contrastive pre-training and fine-tuning on synthetic video data
Text-to-Image
P
facebook
11.52k
34
AKI 4B Phi 3.5 Mini
AKI is a multimodal foundation model that achieves cross-modal mutual attention (MMA) by unlocking the causal attention mechanism in LLMs, addressing vision-language misalignment without additional parameters or training time.
Image-to-Text English
A
Sony
25
27
Florence 2 VLM Doc VQA
A specialized version for Visual Question Answering (VQA) fine-tuned based on microsoft/Florence-2-base-ft, capable of interpreting image content and answering related questions
Text-to-Image Transformers English
F
prithivMLmods
69
4
Instructblip Vicuna 13b
Other
InstructBLIP is the visual instruction-tuned version of BLIP-2, based on the Vicuna-13b language model, designed for vision-language tasks.
Image-to-Text Transformers English
I
Salesforce
1,251
42
Blip2zh Chatglm 6b
A Chinese multimodal chat model trained based on BLIP2, with basic image understanding capabilities, and text dialogue performance consistent with ChatGLM
Text-to-Image Transformers Supports Multiple Languages
B
Xipotzzz
22
22
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase