PromptCap開源圖像描述模型 - 免費實現視覺問答與通用描述生成

首頁

Promptcap Coco Vqa

由tifa-benchmark開發

PromptCap是一個可通過自然語言指令控制的圖像描述生成模型，支持視覺問答和通用描述生成任務。

圖像生成文本

Transformers

英語開源協議:Openrail #提示引導圖像描述 #多任務視覺問答 #OCR融合理解

下載量 121

發布時間 : 1/23/2023

模型概述

PromptCap是一個基於提示引導的任務感知圖像描述生成模型，能夠根據用戶提供的自然語言指令生成圖像描述，支持與GPT-3等大語言模型配合使用。

模型特點

提示引導控制

可通過自然語言指令控制描述生成，支持特定問題引導和通用描述生成

輕量級視覺插件

比BLIP-2更快，適合與GPT-3、ChatGPT等大語言模型配合使用

OCR支持

能夠處理包含OCR文本輸入的圖像描述生成任務

開放域問答

與傳統VQA模型不同，支持與任意文本QA模型結合進行開放域問答

模型能力

圖像描述生成

視覺問答

多模態理解

OCR文本處理

開放域問答

使用案例

視覺問答

知識型視覺問答

與GPT-3結合回答需要外部知識的視覺問題

在OK-VQA上達到60.4%，A-OKVQA上達到59.6%的SOTA表現

多選題問答

支持基於給定選項的多選題視覺問答

圖像描述生成

通用圖像描述

生成圖像的通用描述

在COCO描述生成任務上達到150 CIDEr的SOTA性能

任務感知描述

根據特定問題生成聚焦的圖像描述

🚀 PromptCap：基於提示引導的圖像描述模型

PromptCap是一個可由自然語言指令控制的圖像描述模型，能夠處理圖像描述、視覺問答等任務。它可以作為輕量級視覺插件與大語言模型結合使用，在COCO圖像描述和基於知識的視覺問答任務中取得了優異的成績。

🚀 快速開始

✨ 主要特性

自然語言指令控制：支持通過自然語言指令控制模型，指令中可包含用戶感興趣的問題，例如“這個男孩正在穿什麼衣服？”。
通用描述支持：支持通用的圖像描述，使用問題“圖像描述了什麼？”即可。
輕量級視覺插件：可作為輕量級視覺插件與GPT - 3、ChatGPT等大語言模型以及Segment Anything、DINO等基礎模型配合使用，速度比BLIP - 2快很多。
優異性能：在COCO圖像描述任務中達到了SOTA性能（150 CIDEr），與GPT - 3結合並基於用戶問題時，在基於知識的視覺問答任務中取得了SOTA性能（OK - VQA上60.4%，A - OKVQA上59.6%）。

📦 安裝指南

pip install promptcap

💻 使用示例

基礎用法

本項目包含兩個管道，一個用於圖像描述，另一個用於視覺問答。

圖像描述管道

請遵循提示格式，以獲得最佳性能。按照以下步驟生成提示引導的圖像描述：

import torch
from promptcap import PromptCap

model = PromptCap("tifa-benchmark/promptcap-coco-vqa")  # 也支持OFA檢查點。例如 "OFA-Sys/ofa-large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

若要進行通用描述，可使用問題“what does the image describe?”：

prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

PromptCap還支持接受OCR輸入：

prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))

視覺問答管道

與典型的在VQAv2上進行分類的視覺問答模型不同，PromptCap是開放域的，可以與任意文本問答模型配合使用。這裡提供了一個將PromptCap與UnifiedQA結合的管道。

import torch
from promptcap import PromptCap_VQA

# QA模型支持所有UnifiedQA變體。例如 "allenai/unifiedqa-v2-t5-large-1251000"
vqa_model = PromptCap_VQA(promptcap_model="tifa-benchmark/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base")

if torch.cuda.is_available():
  vqa_model.cuda()

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(vqa_model.vqa(question, image))

同樣，PromptCap支持OCR輸入：

question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(vqa_model.vqa(question, image, ocr=ocr))

由於Unifiedqa的靈活性，PromptCap還支持多項選擇視覺問答：

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))

📚 詳細文檔

這是論文 PromptCap: Prompt-Guided Task-Aware Image Captioning 的代碼倉庫。該論文以 PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 為題被ICCV 2023收錄。

📄 許可證

本項目採用OpenRail許可證。

🔍 信息表格

屬性	詳情
模型類型	圖像到文本
訓練數據	COCO、TextVQA、VQAv2、OK - VQA、A - OKVQA

📖 BibTeX引用

@article{hu2022promptcap,
  title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}