🚀 用於圖像描述的VinVL模型卡片 🖼️
微軟的VinVL 基礎預訓練模型,專為圖像描述生成下游任務而設計。該模型能有效處理圖像並生成準確的文字描述,為圖像理解和信息傳遞提供了強大的支持。
🚀 快速開始
from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer
ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]
inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)
pred = tensorizer.decode(outputs)
✨ 主要特性
📦 安裝指南
關於如何安裝和使用此模型的更多信息,請參考:michelecafagna26/VinVL
💻 使用示例
基礎用法
from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer
ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]
inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)
pred = tensorizer.decode(outputs)
📚 詳細文檔
COCO測試集指標 📈
作者提供的表格(表7,交叉熵優化)
Bleu-4 |
METEOR |
CIDEr |
SPICE |
0.38 |
0.30 |
1.29 |
0.23 |
特徵提取 ⛏️
此模型有一個獨立的視覺骨幹網絡用於特徵提取。
更多關於以下內容的信息:
數據集
屬性 |
詳情 |
訓練數據 |
coco、conceptual-caption、sbu、flickr30k、vqa、gqa、vg - qa、open - images |
庫信息
屬性 |
詳情 |
庫名稱 |
pytorch |
標籤 |
pytorch、image-to-text |
📄 許可證
本模型採用Apache-2.0許可證。
🧾 引用
請考慮引用原始項目和VinVL論文
@misc{han2021image,
title={Image Scene Graph Generation (SGG) Benchmark},
author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
year={2021},
eprint={2107.12604},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{zhang2021vinvl,
title={Vinvl: Revisiting visual representations in vision-language models},
author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={5579--5588},
year={2021}
}