微軟開源VinVL圖像描述模型 - 免費部署實現精準圖像內容描述生成

首頁

Vinvl Base Image Captioning

由michelecafagna26開發

微軟VinVL基礎預訓練模型，專為圖像描述生成任務設計，具備強大的視覺-語言理解能力。

圖像生成文本

PyTorch

開源協議:Apache-2.0 #多模態圖像描述 #高精度視覺特徵 #場景圖生成

下載量 45

發布時間 : 12/23/2022

模型概述

VinVL是一個視覺-語言預訓練模型，主要用於從圖像生成自然語言描述。它結合了視覺特徵提取和語言生成能力，能夠理解圖像內容並生成準確的描述文本。

模型特點

強大的視覺特徵提取

配備獨立的視覺骨幹網絡，能夠有效提取圖像特徵

多數據集預訓練

在COCO、Conceptual Captions等多個視覺-語言數據集上預訓練

高性能圖像描述生成

在COCO測試集上達到先進的圖像描述生成性能

模型能力

圖像理解

自然語言生成

視覺-語言對齊

使用案例

內容生成

自動圖像標註

為圖片庫中的圖像自動生成描述性文本

生成準確、流暢的圖像描述

輔助技術

視覺輔助

為視障人士提供圖像內容描述

幫助理解視覺內容

🚀 用於圖像描述的VinVL模型卡片 🖼️

微軟的VinVL 基礎預訓練模型，專為圖像描述生成下游任務而設計。該模型能有效處理圖像並生成準確的文字描述，為圖像理解和信息傳遞提供了強大的支持。

🚀 快速開始

from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer

ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 原始代碼
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)

# 負責預處理
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)

# 形狀為 (1, num_boxes, feat_size) 的numpy數組
# 在VinVL中，feat_size默認是2054
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)

# 標籤通常由特徵提取器提取
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]

inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)

pred = tensorizer.decode(outputs)

# 輸出如下所示
# pred = {0: [{'caption': 'a red and white boat traveling down a river next to a small boat.', 'conf': 0.7070220112800598]}

✨ 主要特性

該模型有一個獨立的視覺骨幹網絡用於特徵提取。

📦 安裝指南

關於如何安裝和使用此模型的更多信息，請參考：michelecafagna26/VinVL

💻 使用示例

基礎用法

from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer

ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 原始代碼
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)

# 負責預處理
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)

# 形狀為 (1, num_boxes, feat_size) 的numpy數組
# 在VinVL中，feat_size默認是2054
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)

# 標籤通常由特徵提取器提取
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]

inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)

pred = tensorizer.decode(outputs)

# 輸出如下所示
# pred = {0: [{'caption': 'a red and white boat traveling down a river next to a small boat.', 'conf': 0.7070220112800598]}

📚 詳細文檔

COCO測試集指標 📈

作者提供的表格（表7，交叉熵優化）

Bleu-4	METEOR	CIDEr	SPICE
0.38	0.30	1.29	0.23

特徵提取 ⛏️

此模型有一個獨立的視覺骨幹網絡用於特徵提取。

更多關於以下內容的信息：

模型：michelecafagna26/vinvl_vg_x152c4
使用和安裝：michelecafagna26/vinvl-visualbackbone

數據集

屬性	詳情
訓練數據	coco、conceptual-caption、sbu、flickr30k、vqa、gqa、vg - qa、open - images

庫信息

屬性	詳情
庫名稱	pytorch
標籤	pytorch、image-to-text

📄 許可證

本模型採用Apache-2.0許可證。

🧾 引用

請考慮引用原始項目和VinVL論文


@misc{han2021image,
      title={Image Scene Graph Generation (SGG) Benchmark}, 
      author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
      year={2021},
      eprint={2107.12604},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{zhang2021vinvl,
  title={Vinvl: Revisiting visual representations in vision-language models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5579--5588},
  year={2021}
}