vit2distilgpt2 Open-Source Image-to-Text Generation Model - Convert Images to Descriptive Text for Free

Home

Vit2distilgpt2

Developed by sachin

This is an image-to-text generation model capable of receiving images and outputting descriptive text.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Image to Text #Visual Encoder-Decoder #Trained on COCO Dataset

Downloads 49

Release Time : 3/2/2022

Model Overview

The model is based on ViT and DistilGPT2 architectures, specifically designed for image captioning tasks, trained on the COCO2017 dataset.

Model Features

Vision-Language Joint Model

Combines visual encoder and language decoder to achieve image-to-text conversion

Trained on COCO Dataset

Trained on a widely-used image captioning dataset, offering good generalization capabilities

Lightweight Architecture

Uses DistilGPT2 as the decoder, making it more lightweight compared to the full GPT2

Model Capabilities

Image Understanding

Text Generation

Image Caption Generation

Use Cases

Assistive Technology

Visual Assistance

Generates image descriptions for visually impaired individuals

Content Generation

Social Media Content Auto-Generation

Automatically generates descriptive text for uploaded images

🚀 Vit2-DistilGPT2

This model takes an image as input and outputs a caption. It was trained on the Coco dataset, offering a practical solution for image - to - text tasks.

🚀 Quick Start

This model takes in an image and outputs a caption. It was trained using the Coco dataset and the full training script can be found in this kaggle kernel

💻 Usage Examples

Basic Usage

import Image
from transformers import AutoModel, GPT2Tokenizer, ViTFeatureExtractor
model = AutoModel.from_pretrained("sachin/vit2distilgpt2")
vit_feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
    outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
    return outputs
    
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
# set pad_token_id to unk_token_id -> be careful here as unk_token_id == eos_token_id == bos_token_id
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token
image = (Image.open(image_path).convert("RGB"), return_tensors="pt").pixel_values
encoder_outputs = model.generate(image.unsqueeze(0))
generated_sentences = gpt2_tokenizer.batch_decode(encoder_outputs, skip_special_tokens=True)

⚠️ Important Note

Note that the output sentence may be repeated, hence a post processing step may be required.

📚 Documentation

Bias Warning

This model may be biased due to dataset, lack of long training and the model itself. The following gender bias is an example.