Open-source Image Captioning Model vit-swin-base-224-gpt2 - Precise and Vivid Descriptions for Photos

Vit Swin Base 224 Gpt2 Image Captioning

Developed by Abdou

An image caption generation model based on the VisionEncoderDecoder architecture, using Swin Transformer as the visual encoder and GPT-2 as the decoder, fine-tuned on the COCO2014 dataset

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Image Caption Generation #Swin-GPT2 Architecture #COCO Fine-tuning

Downloads 321

Release Time : 2/5/2023

Model Overview

This model is used to automatically generate English descriptions of images, combining visual encoding and text generation capabilities

Model Features

Hybrid Architecture

Combines the visual encoding capability of Swin Transformer with the text generation capability of GPT-2

Efficient Training

Fine-tuned on 60% of the COCO dataset, with a training time of only 5 hours (A100 GPU)

Multi-metric Optimization

Simultaneously optimizes multiple text generation metrics such as ROUGE and BLEU

Model Capabilities

Image Understanding

English Description Generation

Natural Language Generation

Use Cases

Assistive Technology

Assistance for Visually Impaired

Automatically generates image descriptions for visually impaired users

Content Management

Automatic Image Tagging

Automatically generates descriptive tags for image libraries

🚀 vit-swin-base-224-gpt2-image-captioning

This model is a fine - tuned VisionEncoderDecoder model. It is trained on 60% of the COCO2014 dataset and can be used for image captioning.

✨ Features

Trained on a large - scale image dataset COCO2014.
Achieved good performance on testing set metrics such as Rouge and Bleu.
Can be used easily through simple pipeline API or with more flexibility by initializing components.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

You can use the simple pipeline API:

from transformers import pipeline

image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
# infer the caption
caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
print(f"caption: {caption}")

Advanced Usage

Or initialize everything for more flexibility:

from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
import torch
import os
import urllib.parse as parse
from PIL import Image
import requests

# a function to determine whether a string is a URL or not
def is_url(string):
    try:
        result = parse.urlparse(string)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False
    
# a function to load an image
def load_image(image_path):
    if is_url(image_path):
        return Image.open(requests.get(image_path, stream=True).raw)
    elif os.path.exists(image_path):
        return Image.open(image_path)

# a function to perform inference
def get_caption(model, image_processor, tokenizer, image_path):
    image = load_image(image_path)
    # preprocess the image
    img = image_processor(image, return_tensors="pt").to(device)
    # generate the caption (using greedy decoding by default)
    output = model.generate(**img)
    # decode the output
    caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    return caption

device = "cuda" if torch.cuda.is_available() else "cpu"
# load the fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")

# target image
url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
# get the caption
caption = get_caption(model, image_processor, tokenizer, url)
print(f"caption: {caption}")

Output:

Two cows laying in a field with a sky background.

📚 Documentation

Model description

The model was initialized on microsoft/swin-base-patch4-window7-224-in22k as the vision encoder, and gpt2 as the decoder.

Intended uses & limitations

You can use this model for image captioning only.

Training procedure

You can check this guide to learn how this model was fine - tuned.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 64
eval_batch_size: 64
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum	Bleu	Gen Len
1.0018	0.38	2000	0.8860	38.6537	13.8145	35.3932	35.3935	8.2448	11.2946
0.8827	0.75	4000	0.8395	40.0458	14.8829	36.5321	36.5366	9.1169	11.2946
0.8378	1.13	6000	0.8140	41.2736	15.9576	37.5504	37.5512	9.871	11.2946
0.7913	1.51	8000	0.8012	41.6642	16.1987	37.8786	37.8891	10.0786	11.2946
0.7794	1.89	10000	0.7933	41.9119	16.3738	38.1062	38.1292	10.288	11.2946

Total training time: ~5 hours on NVIDIA A100 GPU.

Framework versions

Transformers 4.26.0
Pytorch 1.13.1+cu116
Datasets 2.9.0
Tokenizers 0.13.2

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご