P

Promptcap Coco Vqa

Developed by tifa-benchmark
PromptCap is an image captioning model controllable via natural language instructions, supporting visual question answering and general description generation tasks.
Downloads 121
Release Time : 1/23/2023

Model Overview

PromptCap is a prompt-guided task-aware image captioning model that generates image descriptions based on user-provided natural language instructions, compatible with large language models like GPT-3.

Model Features

Prompt-guided control
Generates descriptions controlled by natural language instructions, supporting both specific question guidance and general description generation.
Lightweight vision plugin
Faster than BLIP-2, suitable for integration with large language models like GPT-3 and ChatGPT.
OCR support
Capable of handling image captioning tasks involving OCR text inputs.
Open-domain question answering
Unlike traditional VQA models, supports open-domain QA when combined with arbitrary text QA models.

Model Capabilities

Image captioning
Visual question answering
Multimodal understanding
OCR text processing
Open-domain question answering

Use Cases

Visual question answering
Knowledge-based visual QA
Combines with GPT-3 to answer visual questions requiring external knowledge
Achieves SOTA performance of 60.4% on OK-VQA and 59.6% on A-OKVQA
Multiple-choice QA
Supports multiple-choice visual question answering based on given options
Image captioning
General image captioning
Generates general descriptions of images
Achieves SOTA performance of 150 CIDEr on COCO captioning task
Task-aware captioning
Generates focused image descriptions based on specific questions
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase