微软开源VinVL图像描述模型 - 免费部署实现精准图像内容描述生成

首页

Vinvl Base Image Captioning

由 michelecafagna26 开发

微软VinVL基础预训练模型，专为图像描述生成任务设计，具备强大的视觉-语言理解能力。

图像生成文本

PyTorch

开源协议:Apache-2.0 #多模态图像描述 #高精度视觉特征 #场景图生成

下载量 45

发布时间 : 12/23/2022

模型简介

VinVL是一个视觉-语言预训练模型，主要用于从图像生成自然语言描述。它结合了视觉特征提取和语言生成能力，能够理解图像内容并生成准确的描述文本。

模型特点

强大的视觉特征提取

配备独立的视觉骨干网络，能够有效提取图像特征

多数据集预训练

在COCO、Conceptual Captions等多个视觉-语言数据集上预训练

高性能图像描述生成

在COCO测试集上达到先进的图像描述生成性能

模型能力

图像理解

自然语言生成

视觉-语言对齐

使用案例

内容生成

自动图像标注

为图片库中的图像自动生成描述性文本

生成准确、流畅的图像描述

辅助技术

视觉辅助

为视障人士提供图像内容描述

帮助理解视觉内容

🚀 用于图像描述的VinVL模型卡片 🖼️

微软的VinVL 基础预训练模型，专为图像描述生成下游任务而设计。该模型能有效处理图像并生成准确的文字描述，为图像理解和信息传递提供了强大的支持。

🚀 快速开始

from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer

ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 原始代码
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)

# 负责预处理
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)

# 形状为 (1, num_boxes, feat_size) 的numpy数组
# 在VinVL中，feat_size默认是2054
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)

# 标签通常由特征提取器提取
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]

inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)

pred = tensorizer.decode(outputs)

# 输出如下所示
# pred = {0: [{'caption': 'a red and white boat traveling down a river next to a small boat.', 'conf': 0.7070220112800598]}

✨ 主要特性

该模型有一个独立的视觉骨干网络用于特征提取。

📦 安装指南

关于如何安装和使用此模型的更多信息，请参考：michelecafagna26/VinVL

💻 使用示例

基础用法

from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer

ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 原始代码
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)

# 负责预处理
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)

# 形状为 (1, num_boxes, feat_size) 的numpy数组
# 在VinVL中，feat_size默认是2054
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)

# 标签通常由特征提取器提取
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]

inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)

pred = tensorizer.decode(outputs)

# 输出如下所示
# pred = {0: [{'caption': 'a red and white boat traveling down a river next to a small boat.', 'conf': 0.7070220112800598]}

📚 详细文档

COCO测试集指标 📈

作者提供的表格（表7，交叉熵优化）

Bleu-4	METEOR	CIDEr	SPICE
0.38	0.30	1.29	0.23

特征提取 ⛏️

此模型有一个独立的视觉骨干网络用于特征提取。

更多关于以下内容的信息：

模型：michelecafagna26/vinvl_vg_x152c4
使用和安装：michelecafagna26/vinvl-visualbackbone

数据集

属性	详情
训练数据	coco、conceptual-caption、sbu、flickr30k、vqa、gqa、vg - qa、open - images

库信息

属性	详情
库名称	pytorch
标签	pytorch、image-to-text

📄 许可证

本模型采用Apache-2.0许可证。

🧾 引用

请考虑引用原始项目和VinVL论文


@misc{han2021image,
      title={Image Scene Graph Generation (SGG) Benchmark}, 
      author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
      year={2021},
      eprint={2107.12604},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{zhang2021vinvl,
  title={Vinvl: Revisiting visual representations in vision-language models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5579--5588},
  year={2021}
}