Open-source Vision-Language Model for Image Caption Generator - Generate Natural Language Descriptions for Images for Free

Image Caption Generator

Developed by bipin

A vision-language model trained on the Flickr8k dataset, capable of generating natural language descriptions for input images

Image-to-Text

Transformers

#Image-to-Text #Vision-Language Model #Flickr8k Training

Downloads 177

Release Time : 3/27/2022

Model Overview

This model is an image-to-text conversion model that analyzes the content of input images and generates corresponding textual descriptions. Based on Transformer architecture, it combines a visual encoder and a text decoder.

Model Features

Transformer-Based Architecture

Combines visual encoder (ViT) and text decoder (GPT2) for efficient image-to-text conversion

End-to-End Training

The entire model is trained end-to-end, simplifying the image caption generation process

Beam Search Generation

Supports beam search generation strategy to improve the quality of generated descriptions

Model Capabilities

Image Content Understanding

Natural Language Description Generation

Vision-Language Conversion

Use Cases

Assistive Technology

Visual Assistance

Provides audio descriptions of image content for visually impaired individuals

Content Management

Automatic Image Tagging

Automatically generates descriptive tags for large volumes of images to facilitate search and management

🚀 Image-caption-generator

This model is designed to generate captions for images. It's trained on the Flickr8k dataset, offering reliable captioning capabilities.

🚀 Quick Start

Load the pre - trained model from the model hub

from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
import torch
from PIL import Image

model_name = "bipin/image-caption-generator"

# load model
model = VisionEncoderDecoderModel.from_pretrained(model_name)
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Load the image for which the caption is to be generated(note: replace the value of `img_name` with image of your choice)

### replace the value with your image
img_name = "flickr_data.jpg"
img = Image.open(img_name)
if img.mode != 'RGB':
    img = img.convert(mode="RGB")

Pre - process the image

pixel_values = feature_extractor(images=[img], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)

Generate the caption

max_length = 128
num_beams = 4

# get model prediction
output_ids = model.generate(pixel_values, num_beams=num_beams, max_length=max_length)

# decode the generated prediction
preds = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(preds)

📚 Documentation

This model achieves the following results on the evaluation set:

eval_loss: 0.2536
eval_runtime: 25.369
eval_samples_per_second: 63.818
eval_steps_per_second: 8.002
epoch: 4.0
step: 3236

🔧 Technical Details

Training procedure

The procedure used to train this model can be found here.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
num_epochs: 5

Framework versions

Transformers 4.16.2
Pytorch 1.9.1
Datasets 1.18.4
Tokenizers 0.11.6

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご