Open-source Image Captioning Model Image_Captioning_Model - Automatically Generate Natural Language Descriptions for Images

Image Captioning Model

Developed by premanthcharan

A model combining Vision Transformer (ViT) with natural language processing to automatically generate natural language descriptions for input images

Image-to-Text

PyTorch

Open Source License:Apache-2.0 #Vision-Language Joint Modeling #Multimodal Attention Mechanism #Scene Understanding and Description

Downloads 28

Release Time : 11/12/2024

Model Overview

This model achieves image-to-text conversion through a vision encoder-decoder architecture, utilizing ResNet101 for feature extraction and a multi-layer transformer structure. Trained on the MS COCO dataset, it supports high-quality image caption generation

Model Features

Vision-Language Joint Modeling

Achieves deep association between image features and text descriptions through end-to-end training

Attention Mechanism Optimization

Uses multi-head attention with positional encoding to accurately capture key image regions and their textual correspondences

Multi-metric Evaluation System

Supports automatic quality assessment across multiple dimensions including BLEU, METEOR, and CIDEr

Model Capabilities

Image Understanding

Natural Language Generation

Scene Description

Multimodal Processing

Use Cases

Assistive Technology

Visual Impairment Assistance

Automatically describes surroundings for visually impaired users

Enhances environmental awareness for the visually impaired

Content Management

Automatic Image Tagging

Generates search tags for large image collections

Improves image retrieval efficiency

🚀 The Illustrated Image Captioning using transformers model

This project uses a transformer-based model to generate descriptions for images, aiming to achieve improved image understanding and caption generation by leveraging Vision Transformers.

🚀 Quick Start

This repository focuses on the challenging task of image captioning, which involves generating human - like descriptions for images. By leveraging Vision Transformers, this project aims to enhance image understanding and caption generation. The combination of computer vision and Transformers has shown promising results in various natural language processing tasks, and this project explores their application to image captioning.

✨ Features

Advanced Model Utilization: Employs Vision Transformers (ViTs), attention mechanisms, language modeling, transfer learning, and multiple evaluation metrics for image captioning.
Diverse Library Support: Implemented in Python with libraries such as PyTorch, Transformers, TorchVision, NumPy, NLTK, and Matplotlib.
Comprehensive Training and Inference: Trained using cross - entropy loss with doubly stochastic attention regularization, and uses beam search for inference.

📦 Installation

Install COCO API

Clone this repo:

git clone https://github.com/cocodataset/cocoapi.git

Setup the coco API (also described in the readme here)

cd cocoapi/PythonAPI
make
cd ..

Download some specific data from here: http://cocodataset.org/#download (described below)
- Under Annotations, download:
  - 2017 Train/Val annotations [241MB] (extract captions_train2017.json and captions_val2017.json, and place at locations cocoapi/annotations/captions_train2017.json and cocoapi/annotations/captions_val2017.json, respectively)
  - 2017 Testing Image info [1MB] (extract image_info_test2017.json and place at location cocoapi/annotations/image_info_test2017.json)
- Under Images, download:
  - 2017 Train images [83K/13GB] (extract the train2017 folder and place at location cocoapi/images/train2017/)
  - 2017 Val images [41K/6GB] (extract the val2017 folder and place at location cocoapi/images/val2017/)
  - 2017 Test images [41K/6GB] (extract the test2017 folder and place at location cocoapi/images/test2017/)

Preparing the environment

Note: This project was developed on Mac. It can surely be run on Windows and Linux with some minor changes.

Clone the repository, and navigate to the downloaded folder.

git clone https://github.com/CapstoneProjectimagecaptioning/image_captioning_transformer.git
cd image_captioning_transformer

Create (and activate) a new environment, named captioning_env with Python 3.7. If prompted to proceed with the install (Proceed [y]/n) type y.

conda create -n captioning_env python=3.7
source activate captioning_env

At this point your command line should look something like: (captioning_env) <User>:image_captioning <user>$. The (captioning_env) indicates that your environment has been activated, and you can proceed with further package installations. 6. Before you can experiment with the code, you'll have to make sure that you have all the libraries and dependencies required to support this project. You will mainly need Python 3.7+, PyTorch and its torchvision, OpenCV, and Matplotlib. You can install dependencies using:

pip install -r requirements.txt

Navigate back to the repo. (Also, your source environment should still be activated at this point.)

cd image_captioning

Open the directory of notebooks, using the below command. You'll see all of the project files appear in your local environment; open the first notebook and follow the instructions.

jupyter notebook

Once you open any of the project notebooks, make sure you are in the correct captioning_env environment by clicking Kernel > Change Kernel > captioning_env.

💻 Usage Examples

Basic Usage

The overall process of using this image captioning model involves data loading, pre - processing, training, and inference. Here is a high - level overview of the steps:

# The following is a simplified representation of the actual code steps
# 1. Data Loading and Preprocessing
import pandas as pd
import json
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Load Annotations
with open('cocoapi/annotations/captions_train2017.json', 'r') as f:
    data = json.load(f)
img_cap_pairs = []
for ann in data['annotations']:
    img_id = ann['image_id']
    caption = ann['caption']
    img_name = next((img['file_name'] for img in data['images'] if img['id'] == img_id), None)
    if img_name:
        img_cap_pairs.append((img_name, caption))

# Pairing Images and Captions
df = pd.DataFrame(img_cap_pairs, columns=['image', 'caption'])

# Sampling Data
df = df.sample(n = 70000)

# 2. Text Preprocessing
import re
def preprocess_caption(caption):
    caption = caption.lower()
    caption = re.sub(r'[^\w\s]', '', caption)
    caption = re.sub(r'\s+', ' ', caption)
    caption = '[start] ' + caption + ' [end]'
    return caption

df['caption'] = df['caption'].apply(preprocess_caption)

# 3. Tokenization
vocab_size = 15000
max_length = 40
tokenizer = TextVectorization(max_tokens = vocab_size, output_sequence_length = max_length)
tokenizer.adapt(df['caption'])
vocab = tokenizer.get_vocabulary()
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for idx, word in enumerate(vocab)}

# 4. Dataset Preparation
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split

image_to_captions = {}
for index, row in df.iterrows():
    image = row['image']
    caption = row['caption']
    if image not in image_to_captions:
        image_to_captions[image] = []
    image_to_captions[image].append(caption)

images = list(image_to_captions.keys())
train_images, val_images = train_test_split(images, test_size = 0.2)

def load_data(image_path, captions):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels = 3)
    image = tf.image.resize(image, (224, 224))
    image = tf.keras.applications.resnet.preprocess_input(image)
    captions = tokenizer(captions)
    return image, captions

train_dataset = tf.data.Dataset.from_tensor_slices((train_images, [image_to_captions[img] for img in train_images]))
train_dataset = train_dataset.map(load_data).batch(32)

val_dataset = tf.data.Dataset.from_tensor_slices((val_images, [image_to_captions[img] for img in val_images]))
val_dataset = val_dataset.map(load_data).batch(32)

Advanced Usage

# Training the model
import tensorflow as tf
from tensorflow.keras import layers

# Define the model architecture (simplified here)
class ImageCaptioningModel(tf.keras.Model):
    def __init__(self):
        super(ImageCaptioningModel, self).__init__()
        self.image_encoder = tf.keras.applications.ResNet101(include_top = False, weights = 'imagenet')
        self.image_encoder.trainable = False
        self.image_embedding = layers.Dense(256)
        self.decoder = layers.LSTM(256)
        self.fc = layers.Dense(vocab_size)

    def call(self, image, caption):
        image_features = self.image_encoder(image)
        image_features = layers.Flatten()(image_features)
        image_emb = self.image_embedding(image_features)
        decoder_output = self.decoder(tf.concat([image_emb, caption], axis = 1))
        output = self.fc(decoder_output)
        return output

model = ImageCaptioningModel()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True, reduction = 'none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype = loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

optimizer = tf.keras.optimizers.Adam()

@tf.function
def train_step(image, caption):
    with tf.GradientTape() as tape:
        predictions = model(image, caption)
        loss = loss_function(caption[:, 1:], predictions[:, :-1])
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

EPOCHS = 100
for epoch in range(EPOCHS):
    total_loss = 0
    for (batch, (image, caption)) in enumerate(train_dataset):
        batch_loss = train_step(image, caption)
        total_loss += batch_loss
    print(f'Epoch {epoch + 1}: Loss {total_loss / len(train_dataset)}')

# Inference
import numpy as np
def generate_caption(image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels = 3)
    image = tf.image.resize(image, (224, 224))
    image = tf.keras.applications.resnet.preprocess_input(image)
    image = tf.expand_dims(image, axis = 0)
    image_features = model.image_encoder(image)
    image_features = layers.Flatten()(image_features)
    image_emb = model.image_embedding(image_features)

    caption = np.array([word2idx['[start]']])
    caption = tf.expand_dims(caption, axis = 0)
    result = []
    for _ in range(max_length):
        predictions = model(image_emb, caption)
        predictions = predictions[:, -1, :]
        predicted_id = tf.argmax(predictions, axis = 1).numpy()[0]
        if idx2word[predicted_id] == '[end]':
            break
        result.append(idx2word[predicted_id])
        caption = tf.concat([caption, tf.expand_dims([predicted_id], axis = 0)], axis = 1)
    return ' '.join(result)

📚 Documentation

Dataset Used

About MS COCO dataset

The Microsoft Common Objects in COntext (MS COCO) dataset is a large - scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms.

![Image 11 - 15 - 24 at 5 12 PM](https://github.com/user - attachments/assets/1656bf8b - f13b - 42ad - aeaa - 4eef012f10d6)

You can read more about the dataset on the website, research paper, or Appendix section at the end of this page.

Models and Technologies Used

The following methods and techniques are employed in this project:

Vision Transformers (ViTs)
Attention mechanisms
Language modeling
Transfer learning
Evaluation metrics for image captioning (e.g., BLEU, METEOR, CIDEr)

The project is implemented in Python and utilizes the following libraries:

PyTorch
Transformers
TorchVision
NumPy
NLTK
Matplotlib

Introduction

This project uses a transformer [3] based model to generate a description for images. This task is known as the Image Captioning task. Researchers used many methodologies to approach this problem. One of these methodologies is the encoder - decoder neural network [4]. The encoder transforms the source image into a representation space; then, the decoder translates the information from the encoded space into a natural language. The goal of the encoder - decoder is to minimize the loss of generating a description from an image.

As shown in the survey done by MD Zakir Hossain et al. [4], we can see that the models that use encoder - decoder architecture mainly consist of a language model based on LSTM [5], which decodes the encoded image received from a CNN, see Figure 1. The limitation of LSTM with long sequences and the success of transformers in machine translation and other NLP tasks attracts attention to utilizing it in machine vision. Alexey Dosovitskiy et al. introduce an image classification model (ViT) based on a classical transformer encoder showing a good performance [6]. Based on ViT, Wei Liu et al. present an image captioning model (CPTR) using an encoder - decoder transformer [1]. The source image is fed to the transformer encoder in sequence patches. Hence, one can treat the image captioning problem as a machine translation task.

![image/jpeg](https://cdn - uploads.huggingface.co/production/uploads/672cd2eafa7f9a2a4711d3bc/NBP0ONvIs02htFwzD39z7.jpeg)

Figure 1: Encoder Decoder Architecture

Framework

The CPTR [1] consists of an image patcher that converts images $xnathbb{R}^{Himes Wimes C}$ to a sequence of patches $x_pnathbb{R}^{N(P^2imes E)}$ , where N is number of patches, H, W, C are images height, width and number of chanel C = 3 respectively, P is patch resolution, and E is image embeddings size. Position embeddings are then added to the images patches, which form the input to twelve layers of identical transformer encoders. The output of the last encoder layer goes to four layers of identical transformer decoders. The decoder also takes words with sinusoid positional embedding.

The pre - trained ViT weights initialize the CPTR encoder [1]. I omitted the initialization and image positional embeddings, adding an image embedding module to the image patcher using the features map extracted from the Resnet101 network [7]. The number of encoder layers is reduced to two. For Resenet101, I deleted the last two layers and the last softmax layer used for image classification.

Another modification takes place at the encoder side. The feedforward network consists of two convolution layers with a RELU activation function in between. The encoder side deals solely with the image part, where it is beneficial to exploit the relative position of the features we have. Refer to Figure 2 for the model architecture.

![image/jpeg](https://cdn - uploads.huggingface.co/production/uploads/672cd2eafa7f9a2a4711d3bc/CUSlU9R2oTeYCohHnzOuB.jpeg)

Figure 2: Model Architecture

Training

The transformer decoder output goes to one fully connected layer, which provides –-given the previous token–- a probability distribution ( $nathbb{R}^k$ , k is vocabulary size) for each token in the sequence.

I trained the model using cross - entropy loss given the target ground truth ( $y_{1:T}$ ) where T is the length of the sequence. Also, I add the doubly stochastic attention regularization [8] to the cross - entropy loss to penalize high weights in the encoder - decoder attention. This term encourages the summation of attention weights across the sequence to be approximatively equal to one. By doing so, the model will not concentrate on specific parts in the image when generating a caption. Instead, it will look all over the image, leading to a richer and more descriptive text [8].

The loss function is defined as:

![\large L=-\sum_{c=1}^{T}{log\left(p\left(y_c\middle| y_{c - 1}\right)\right)\ +\ \sum_{l=1}^{L}{\frac{1}{L}\left(\sum_{d=1}^{D}\sum_{i=1}^{P^2}\left(1-\sum_{c=1}^{T}\alpha_{cidl}\right)^2\right)}}](https://latex.codecogs.com/svg.latex?\large%20L=-\sum_{c=1}^{T}{log\left(p\left(y_c\middle|%20y_{c - 1}\right)\right)%20+%20\sum_{l=1}^{L}{\frac{1}{L}\left(\sum_{d=1}^{D}\sum_{i=1}^{P^2}\left(1-\sum_{c=1}^{T}\alpha_{cidl}\right)^2\right)}})

where D is the number of heads and L is the number of layers.

I used Adam optimizer, with a batch size of thirty - two. The reader can find the model sizes in the configuration file code/config.json. Evaluation metrics used are Bleu [9], METEOR [10], and Gleu [11].

I trained the model for one hundred epochs, with stopping criteria if the tracked evaluation metric (bleu - 4) does not improve for twenty successive epochs. Also, the learning rate is reduced by 0.25% if the tracked evaluation metric (bleu - 4) does not improve for ten consecutive epochs. The evaluation of the model against the validation split takes place every two epochs.

The pre - trained Glove embeddings [12] initialize the word embedding weights. The words embeddings are frozen for ten epochs. The Resnet101 network is tuned from the beginning.

Inference

A beam search of size five is used to generate a caption for the images in the test split. The generation starts by feeding the image and the "start of sentence" special tokens. Then at each iteration, five tokens with the highest scores are chosen. The generation iteration stops when the "end of sentence" is generated or the max length limit is reached.

🔧 Technical Details

Steps for Code Explanation

1. Data Loading and Preprocessing

Load Annotations: The code first loads image - caption pairs from the COCO 2017 dataset. It uses JSON files containing images and corresponding captions (captions_train2017.json).
Pairing Images and Captions: The code then creates a list (img_cap_pairs) that pairs image filenames with their respective captions.
Dataframe for Captions: It organizes the data in a pandas DataFrame for easier manipulation, including creating a path to each image file.
Sampling Data: 70,000 image - caption pairs are randomly sampled, making the dataset manageable without needing all data.

2. Text Preprocessing

The code preprocesses captions to prepare them for the model. It lowercases the text, removes punctuation, replaces multiple spaces with single spaces, and adds [start] and [end] tokens, marking the beginning and end of each caption.

3. Tokenization

Vocabulary Setup: A tokenizer (TextVectorization) is created with a vocabulary size of 15,000 words and a maximum token length of 40. It tokenizes captions, transforming them into sequences of integers.
Saving Vocabulary: The vocabulary is saved to a file so that it can be reused later without retraining.
Mapping Words to Indexes: word2idx and idx2word are mappings that convert words to indices and vice versa.

4. Dataset Preparation

Image - Caption Mapping: Using a dictionary, each image is mapped to its list of captions. Then, the images are shuffled, and a train - validation split is made (80% for training, 20% for validation).
Creating TensorFlow Datasets: Using the load_data function, images are resized, preprocessed, and tokenized captions are created as tensors. These tensors are batched for training and validation.

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご