🚀 用于图像描述的VinVL模型卡片 🖼️
微软的VinVL 基础预训练模型,专为图像描述生成下游任务而设计。该模型能有效处理图像并生成准确的文字描述,为图像理解和信息传递提供了强大的支持。
🚀 快速开始
from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer
ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]
inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)
pred = tensorizer.decode(outputs)
✨ 主要特性
📦 安装指南
关于如何安装和使用此模型的更多信息,请参考:michelecafagna26/VinVL
💻 使用示例
基础用法
from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer
ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]
inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)
pred = tensorizer.decode(outputs)
📚 详细文档
COCO测试集指标 📈
作者提供的表格(表7,交叉熵优化)
Bleu-4 |
METEOR |
CIDEr |
SPICE |
0.38 |
0.30 |
1.29 |
0.23 |
特征提取 ⛏️
此模型有一个独立的视觉骨干网络用于特征提取。
更多关于以下内容的信息:
数据集
属性 |
详情 |
训练数据 |
coco、conceptual-caption、sbu、flickr30k、vqa、gqa、vg - qa、open - images |
库信息
属性 |
详情 |
库名称 |
pytorch |
标签 |
pytorch、image-to-text |
📄 许可证
本模型采用Apache-2.0许可证。
🧾 引用
请考虑引用原始项目和VinVL论文
@misc{han2021image,
title={Image Scene Graph Generation (SGG) Benchmark},
author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
year={2021},
eprint={2107.12604},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{zhang2021vinvl,
title={Vinvl: Revisiting visual representations in vision-language models},
author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={5579--5588},
year={2021}
}