đ PromptCap: Prompt - Guided Image Captioning
This repository is dedicated to the paper PromptCap: Prompt - Guided Task - Aware Image Captioning. It has been accepted at ICCV 2023 as [PromptCap: Prompt - Guided Image Captioning for VQA with GPT - 3](https://openaccess.thecvf.com/content/ICCV2023/html/Hu_PromptCap_Prompt - Guided_Image_Captioning_for_VQA_with_GPT - 3_ICCV_2023_paper.html). PromptCap is a captioning model that can be controlled by natural language instructions and offers remarkable performance in image captioning and visual question - answering tasks.
Property |
Details |
License |
OpenRail |
Inference |
False |
Pipeline Tag |
Image - to - text |
Tags |
Image - to - text, Visual - question - answering, Image - captioning |
Datasets |
COCO, TextVQA, VQAv2, OK - VQA, A - OKVQA |
Language |
English |
⨠Features
- Natural Language Control: PromptCap can be guided by natural language instructions, which may include user - interested questions.
- Versatile Applications: It can serve as a light - weight visual plug - in for large language models like GPT - 3, ChatGPT, and other foundation models such as Segment Anything and DINO.
- SOTA Performance: Achieves state - of - the - art performance on COCO captioning (150 CIDEr) and knowledge - based VQA tasks when paired with GPT - 3.
đ Quick Start
đĻ Installation
pip install promptcap
The installation includes two pipelines: one for image captioning and the other for visual question answering.
đģ Usage Examples
đ Captioning Pipeline
Please follow the prompt format for optimal performance.
Basic Usage
import torch
from promptcap import PromptCap
model = PromptCap("tifa - benchmark/promptcap - coco - vqa")
if torch.cuda.is_available():
model.cuda()
prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
print(model.caption(prompt, image))
To try generic captioning:
prompt = "what does the image describe?"
image = "glove_boy.jpeg"
print(model.caption(prompt, image))
Advanced Usage
PromptCap also supports taking OCR inputs:
prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"
print(model.caption(prompt, image, ocr))
đ Visual Question Answering Pipeline
Different from typical VQA models, PromptCap is open - domain and can be paired with arbitrary text - QA models. Here is an example of combining it with UnifiedQA.
Basic Usage
import torch
from promptcap import PromptCap_VQA
vqa_model = PromptCap_VQA(promptcap_model="tifa - benchmark/promptcap - coco - vqa", qa_model="allenai/unifiedqa - t5 - base")
if torch.cuda.is_available():
vqa_model.cuda()
question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
print(vqa_model.vqa(question, image))
Advanced Usage
question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"
print(vqa_model.vqa(question, image, ocr=ocr))
- Multiple - Choice VQA Support
question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))
đ License
The project is under the OpenRail license.
đ Documentation
For more details, please refer to the paper PromptCap: Prompt - Guided Task - Aware Image Captioning.
đ§ Technical Details
PromptCap achieves SOTA performance on COCO captioning (150 CIDEr). When paired with GPT - 3 and conditioned on user questions, it gets SOTA performance on knowledge - based VQA tasks (60.4% on OK - VQA and 59.6% on A - OKVQA).
đ BibTeX
@article{hu2022promptcap,
title={PromptCap: Prompt - Guided Task - Aware Image Captioning},
author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
journal={arXiv preprint arXiv:2211.09699},
year={2022}
}