PromptCap Open-Source Image Captioning Model - Achieve Visual Question Answering and General Caption Generation for Free

Promptcap Coco Vqa

Developed by tifa-benchmark

PromptCap is an image captioning model controllable via natural language instructions, supporting visual question answering and general description generation tasks.

Image-to-Text

Transformers

EnglishOpen Source License:Openrail #Prompt-guided image captioning #Multitask visual question answering #OCR-integrated understanding

Downloads 121

Release Time : 1/23/2023

Model Overview

PromptCap is a prompt-guided task-aware image captioning model that generates image descriptions based on user-provided natural language instructions, compatible with large language models like GPT-3.

Model Features

Prompt-guided control

Generates descriptions controlled by natural language instructions, supporting both specific question guidance and general description generation.

Lightweight vision plugin

Faster than BLIP-2, suitable for integration with large language models like GPT-3 and ChatGPT.

OCR support

Capable of handling image captioning tasks involving OCR text inputs.

Open-domain question answering

Unlike traditional VQA models, supports open-domain QA when combined with arbitrary text QA models.

Model Capabilities

Image captioning

Visual question answering

Multimodal understanding

OCR text processing

Open-domain question answering

Use Cases

Visual question answering

Knowledge-based visual QA

Combines with GPT-3 to answer visual questions requiring external knowledge

Achieves SOTA performance of 60.4% on OK-VQA and 59.6% on A-OKVQA

Multiple-choice QA

Supports multiple-choice visual question answering based on given options

Image captioning

General image captioning

Generates general descriptions of images

Achieves SOTA performance of 150 CIDEr on COCO captioning task

Task-aware captioning

Generates focused image descriptions based on specific questions

🚀 PromptCap: Prompt - Guided Image Captioning

This repository is dedicated to the paper PromptCap: Prompt - Guided Task - Aware Image Captioning. It has been accepted at ICCV 2023 as [PromptCap: Prompt - Guided Image Captioning for VQA with GPT - 3](https://openaccess.thecvf.com/content/ICCV2023/html/Hu_PromptCap_Prompt - Guided_Image_Captioning_for_VQA_with_GPT - 3_ICCV_2023_paper.html). PromptCap is a captioning model that can be controlled by natural language instructions and offers remarkable performance in image captioning and visual question - answering tasks.

Property	Details
License	OpenRail
Inference	False
Pipeline Tag	Image - to - text
Tags	Image - to - text, Visual - question - answering, Image - captioning
Datasets	COCO, TextVQA, VQAv2, OK - VQA, A - OKVQA
Language	English

✨ Features

Natural Language Control: PromptCap can be guided by natural language instructions, which may include user - interested questions.
Versatile Applications: It can serve as a light - weight visual plug - in for large language models like GPT - 3, ChatGPT, and other foundation models such as Segment Anything and DINO.
SOTA Performance: Achieves state - of - the - art performance on COCO captioning (150 CIDEr) and knowledge - based VQA tasks when paired with GPT - 3.

🚀 Quick Start

📦 Installation

pip install promptcap

The installation includes two pipelines: one for image captioning and the other for visual question answering.

💻 Usage Examples

🔍 Captioning Pipeline

Please follow the prompt format for optimal performance.

Basic Usage

import torch
from promptcap import PromptCap

model = PromptCap("tifa - benchmark/promptcap - coco - vqa")  # also support OFA checkpoints. e.g. "OFA - Sys/ofa - large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

To try generic captioning:

prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

Advanced Usage

PromptCap also supports taking OCR inputs:

prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))

🔍 Visual Question Answering Pipeline

Different from typical VQA models, PromptCap is open - domain and can be paired with arbitrary text - QA models. Here is an example of combining it with UnifiedQA.

Basic Usage

import torch
from promptcap import PromptCap_VQA

# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa - v2 - t5 - large - 1251000"
vqa_model = PromptCap_VQA(promptcap_model="tifa - benchmark/promptcap - coco - vqa", qa_model="allenai/unifiedqa - t5 - base")

if torch.cuda.is_available():
  vqa_model.cuda()

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(vqa_model.vqa(question, image))

Advanced Usage

OCR Input Support

question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(vqa_model.vqa(question, image, ocr=ocr))

Multiple - Choice VQA Support

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))

📄 License

The project is under the OpenRail license.

📚 Documentation

For more details, please refer to the paper PromptCap: Prompt - Guided Task - Aware Image Captioning.

🔧 Technical Details

PromptCap achieves SOTA performance on COCO captioning (150 CIDEr). When paired with GPT - 3 and conditioned on user questions, it gets SOTA performance on knowledge - based VQA tasks (60.4% on OK - VQA and 59.6% on A - OKVQA).

📖 BibTeX

@article{hu2022promptcap,
  title={PromptCap: Prompt - Guided Task - Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご