V

Vit2distilgpt2

Developed by sachin
This is an image-to-text generation model capable of receiving images and outputting descriptive text.
Downloads 49
Release Time : 3/2/2022

Model Overview

The model is based on ViT and DistilGPT2 architectures, specifically designed for image captioning tasks, trained on the COCO2017 dataset.

Model Features

Vision-Language Joint Model
Combines visual encoder and language decoder to achieve image-to-text conversion
Trained on COCO Dataset
Trained on a widely-used image captioning dataset, offering good generalization capabilities
Lightweight Architecture
Uses DistilGPT2 as the decoder, making it more lightweight compared to the full GPT2

Model Capabilities

Image Understanding
Text Generation
Image Caption Generation

Use Cases

Assistive Technology
Visual Assistance
Generates image descriptions for visually impaired individuals
Content Generation
Social Media Content Auto-Generation
Automatically generates descriptive text for uploaded images