Microsoft Open-Sources VinVL Image Captioning Model - Free Deployment for Precise Image Content Caption Generation

Vinvl Base Image Captioning

Developed by michelecafagna26

Microsoft's VinVL foundational pre-trained model, specifically designed for image captioning tasks, with strong visual-language understanding capabilities.

Image-to-Text

PyTorch

Open Source License:Apache-2.0 #Multimodal Image Captioning #High-precision Visual Features #Scene Graph Generation

Downloads 45

Release Time : 12/23/2022

Model Overview

VinVL is a vision-language pre-trained model primarily used for generating natural language descriptions from images. It combines visual feature extraction and language generation capabilities to understand image content and produce accurate descriptive text.

Model Features

Powerful Visual Feature Extraction

Equipped with an independent visual backbone network for effective image feature extraction.

Multi-dataset Pre-training

Pre-trained on multiple vision-language datasets including COCO and Conceptual Captions.

High-performance Image Captioning

Achieves state-of-the-art image captioning performance on the COCO test set.

Model Capabilities

Image Understanding

Natural Language Generation

Vision-Language Alignment

Use Cases

Content Generation

Automatic Image Tagging

Automatically generates descriptive text for images in galleries.

Produces accurate and fluent image captions.

Assistive Technology

Visual Assistance

Provides image content descriptions for visually impaired individuals.

Helps in understanding visual content.

🚀 Model Card: VinVL for Captioning 🖼️

This is Microsoft's VinVL base pretrained for the image caption generation downstream task, offering a powerful solution for generating captions from images.

📦 Installation

More info about how to install and use this model can be found here: michelecafagna26/VinVL

✨ Features

COCO Test set metrics 📈
- Table from the authors (Table 7, cross-entropy optimization): | Bleu-4 | METEOR | CIDEr | SPICE | |--------|--------|-------|-------| | 0.38 | 0.30 | 1.29 | 0.23 |
Feature extraction ⛏️
- This model has a separate Visualbackbone used to extract features.
- More info about:
  - the model: michelecafagna26/vinvl_vg_x152c4
  - the usage and installation michelecafagna26/vinvl-visualbackbone

💻 Usage Examples

Basic Usage

from transformers.pytorch_transformers import BertConfig, BertTokenizer
from oscar.modeling.modeling_bert import BertForImageCaptioning
from oscar.wrappers import OscarTensorizer

ckpt = "path/to/the/checkpoint"
device = "cuda" if torch.cuda.is_available() else "cpu"

# original code
config = BertConfig.from_pretrained(ckpt)
tokenizer = BertTokenizer.from_pretrained(ckpt)
model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)

# This takes care of the preprocessing
tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)

# numpy-arrays with shape (1, num_boxes, feat_size)
# feat_size is 2054 by default in VinVL
visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)

# labels are usually extracted by the features extractor
labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]

inputs = tensorizer.encode(visual_features, labels=labels)
outputs = model(**inputs)

pred = tensorizer.decode(outputs)

# the output looks like this:
# pred = {0: [{'caption': 'a red and white boat traveling down a river next to a small boat.', 'conf': 0.7070220112800598]}

📄 License

This project is licensed under the Apache-2.0 license.

🧾 Citations

Please consider citing the original project and the VinVL paper.

@misc{han2021image,
      title={Image Scene Graph Generation (SGG) Benchmark}, 
      author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
      year={2021},
      eprint={2107.12604},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{zhang2021vinvl,
  title={Vinvl: Revisiting visual representations in vision-language models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5579--5588},
  year={2021}
}

📋 Additional Information

Property	Details
Library Name	PyTorch
Tags	PyTorch, image-to-text
Datasets	COCO, conceptual-caption, SBU, Flickr30k, VQA, GQA, VG - QA, Open - Images

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご