Image Captioning Model
Model Overview
Model Features
Model Capabilities
Use Cases
๐ The Illustrated Image Captioning using transformers model
This project uses a transformer-based model to generate descriptions for images, aiming to achieve improved image understanding and caption generation by leveraging Vision Transformers.
๐ Quick Start
This repository focuses on the challenging task of image captioning, which involves generating human - like descriptions for images. By leveraging Vision Transformers, this project aims to enhance image understanding and caption generation. The combination of computer vision and Transformers has shown promising results in various natural language processing tasks, and this project explores their application to image captioning.
โจ Features
- Advanced Model Utilization: Employs Vision Transformers (ViTs), attention mechanisms, language modeling, transfer learning, and multiple evaluation metrics for image captioning.
- Diverse Library Support: Implemented in Python with libraries such as PyTorch, Transformers, TorchVision, NumPy, NLTK, and Matplotlib.
- Comprehensive Training and Inference: Trained using cross - entropy loss with doubly stochastic attention regularization, and uses beam search for inference.
๐ฆ Installation
Install COCO API
- Clone this repo:
git clone https://github.com/cocodataset/cocoapi.git
- Setup the coco API (also described in the readme here)
cd cocoapi/PythonAPI
make
cd ..
- Download some specific data from here: http://cocodataset.org/#download (described below)
- Under Annotations, download:
- 2017 Train/Val annotations [241MB] (extract
captions_train2017.json
andcaptions_val2017.json
, and place at locationscocoapi/annotations/captions_train2017.json
andcocoapi/annotations/captions_val2017.json
, respectively) - 2017 Testing Image info [1MB] (extract
image_info_test2017.json
and place at locationcocoapi/annotations/image_info_test2017.json
)
- 2017 Train/Val annotations [241MB] (extract
- Under Images, download:
- 2017 Train images [83K/13GB] (extract the
train2017
folder and place at locationcocoapi/images/train2017/
) - 2017 Val images [41K/6GB] (extract the
val2017
folder and place at locationcocoapi/images/val2017/
) - 2017 Test images [41K/6GB] (extract the
test2017
folder and place at locationcocoapi/images/test2017/
)
- 2017 Train images [83K/13GB] (extract the
- Under Annotations, download:
Preparing the environment
Note: This project was developed on Mac. It can surely be run on Windows and Linux with some minor changes.
- Clone the repository, and navigate to the downloaded folder.
git clone https://github.com/CapstoneProjectimagecaptioning/image_captioning_transformer.git
cd image_captioning_transformer
- Create (and activate) a new environment, named
captioning_env
with Python 3.7. If prompted to proceed with the install(Proceed [y]/n)
typey
.
conda create -n captioning_env python=3.7
source activate captioning_env
At this point your command line should look something like: (captioning_env) <User>:image_captioning <user>$
. The (captioning_env)
indicates that your environment has been activated, and you can proceed with further package installations.
6. Before you can experiment with the code, you'll have to make sure that you have all the libraries and dependencies required to support this project. You will mainly need Python 3.7+, PyTorch and its torchvision
, OpenCV, and Matplotlib. You can install dependencies using:
pip install -r requirements.txt
- Navigate back to the repo. (Also, your source environment should still be activated at this point.)
cd image_captioning
- Open the directory of notebooks, using the below command. You'll see all of the project files appear in your local environment; open the first notebook and follow the instructions.
jupyter notebook
- Once you open any of the project notebooks, make sure you are in the correct
captioning_env
environment by clickingKernel > Change Kernel > captioning_env
.
๐ป Usage Examples
Basic Usage
The overall process of using this image captioning model involves data loading, pre - processing, training, and inference. Here is a high - level overview of the steps:
# The following is a simplified representation of the actual code steps
# 1. Data Loading and Preprocessing
import pandas as pd
import json
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
# Load Annotations
with open('cocoapi/annotations/captions_train2017.json', 'r') as f:
data = json.load(f)
img_cap_pairs = []
for ann in data['annotations']:
img_id = ann['image_id']
caption = ann['caption']
img_name = next((img['file_name'] for img in data['images'] if img['id'] == img_id), None)
if img_name:
img_cap_pairs.append((img_name, caption))
# Pairing Images and Captions
df = pd.DataFrame(img_cap_pairs, columns=['image', 'caption'])
# Sampling Data
df = df.sample(n = 70000)
# 2. Text Preprocessing
import re
def preprocess_caption(caption):
caption = caption.lower()
caption = re.sub(r'[^\w\s]', '', caption)
caption = re.sub(r'\s+', ' ', caption)
caption = '[start] ' + caption + ' [end]'
return caption
df['caption'] = df['caption'].apply(preprocess_caption)
# 3. Tokenization
vocab_size = 15000
max_length = 40
tokenizer = TextVectorization(max_tokens = vocab_size, output_sequence_length = max_length)
tokenizer.adapt(df['caption'])
vocab = tokenizer.get_vocabulary()
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for idx, word in enumerate(vocab)}
# 4. Dataset Preparation
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split
image_to_captions = {}
for index, row in df.iterrows():
image = row['image']
caption = row['caption']
if image not in image_to_captions:
image_to_captions[image] = []
image_to_captions[image].append(caption)
images = list(image_to_captions.keys())
train_images, val_images = train_test_split(images, test_size = 0.2)
def load_data(image_path, captions):
image = tf.io.read_file(image_path)
image = tf.image.decode_jpeg(image, channels = 3)
image = tf.image.resize(image, (224, 224))
image = tf.keras.applications.resnet.preprocess_input(image)
captions = tokenizer(captions)
return image, captions
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, [image_to_captions[img] for img in train_images]))
train_dataset = train_dataset.map(load_data).batch(32)
val_dataset = tf.data.Dataset.from_tensor_slices((val_images, [image_to_captions[img] for img in val_images]))
val_dataset = val_dataset.map(load_data).batch(32)
Advanced Usage
# Training the model
import tensorflow as tf
from tensorflow.keras import layers
# Define the model architecture (simplified here)
class ImageCaptioningModel(tf.keras.Model):
def __init__(self):
super(ImageCaptioningModel, self).__init__()
self.image_encoder = tf.keras.applications.ResNet101(include_top = False, weights = 'imagenet')
self.image_encoder.trainable = False
self.image_embedding = layers.Dense(256)
self.decoder = layers.LSTM(256)
self.fc = layers.Dense(vocab_size)
def call(self, image, caption):
image_features = self.image_encoder(image)
image_features = layers.Flatten()(image_features)
image_emb = self.image_embedding(image_features)
decoder_output = self.decoder(tf.concat([image_emb, caption], axis = 1))
output = self.fc(decoder_output)
return output
model = ImageCaptioningModel()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True, reduction = 'none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype = loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
optimizer = tf.keras.optimizers.Adam()
@tf.function
def train_step(image, caption):
with tf.GradientTape() as tape:
predictions = model(image, caption)
loss = loss_function(caption[:, 1:], predictions[:, :-1])
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
EPOCHS = 100
for epoch in range(EPOCHS):
total_loss = 0
for (batch, (image, caption)) in enumerate(train_dataset):
batch_loss = train_step(image, caption)
total_loss += batch_loss
print(f'Epoch {epoch + 1}: Loss {total_loss / len(train_dataset)}')
# Inference
import numpy as np
def generate_caption(image_path):
image = tf.io.read_file(image_path)
image = tf.image.decode_jpeg(image, channels = 3)
image = tf.image.resize(image, (224, 224))
image = tf.keras.applications.resnet.preprocess_input(image)
image = tf.expand_dims(image, axis = 0)
image_features = model.image_encoder(image)
image_features = layers.Flatten()(image_features)
image_emb = model.image_embedding(image_features)
caption = np.array([word2idx['[start]']])
caption = tf.expand_dims(caption, axis = 0)
result = []
for _ in range(max_length):
predictions = model(image_emb, caption)
predictions = predictions[:, -1, :]
predicted_id = tf.argmax(predictions, axis = 1).numpy()[0]
if idx2word[predicted_id] == '[end]':
break
result.append(idx2word[predicted_id])
caption = tf.concat([caption, tf.expand_dims([predicted_id], axis = 0)], axis = 1)
return ' '.join(result)
๐ Documentation
Dataset Used
About MS COCO dataset
The Microsoft Common Objects in COntext (MS COCO) dataset is a large - scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms.

You can read more about the dataset on the website, research paper, or Appendix section at the end of this page.
Models and Technologies Used
The following methods and techniques are employed in this project:
- Vision Transformers (ViTs)
- Attention mechanisms
- Language modeling
- Transfer learning
- Evaluation metrics for image captioning (e.g., BLEU, METEOR, CIDEr)
The project is implemented in Python and utilizes the following libraries:
- PyTorch
- Transformers
- TorchVision
- NumPy
- NLTK
- Matplotlib
Introduction
This project uses a transformer [3] based model to generate a description for images. This task is known as the Image Captioning task. Researchers used many methodologies to approach this problem. One of these methodologies is the encoder - decoder neural network [4]. The encoder transforms the source image into a representation space; then, the decoder translates the information from the encoded space into a natural language. The goal of the encoder - decoder is to minimize the loss of generating a description from an image.
As shown in the survey done by MD Zakir Hossain et al. [4], we can see that the models that use encoder - decoder architecture mainly consist of a language model based on LSTM [5], which decodes the encoded image received from a CNN, see Figure 1. The limitation of LSTM with long sequences and the success of transformers in machine translation and other NLP tasks attracts attention to utilizing it in machine vision. Alexey Dosovitskiy et al. introduce an image classification model (ViT) based on a classical transformer encoder showing a good performance [6]. Based on ViT, Wei Liu et al. present an image captioning model (CPTR) using an encoder - decoder transformer [1]. The source image is fed to the transformer encoder in sequence patches. Hence, one can treat the image captioning problem as a machine translation task.

Figure 1: Encoder Decoder Architecture
Framework
The CPTR [1] consists of an image patcher that converts images
to a sequence of patches
,
where N is number of patches, H, W, C are images height, width and number of chanel C = 3 respectively, P is patch resolution, and E is image embeddings size. Position embeddings are then added to the images patches, which form the input to twelve layers of identical transformer encoders. The output of the last encoder layer goes to four layers of identical transformer decoders. The decoder also takes words with sinusoid positional embedding.
The pre - trained ViT weights initialize the CPTR encoder [1]. I omitted the initialization and image positional embeddings, adding an image embedding module to the image patcher using the features map extracted from the Resnet101 network [7]. The number of encoder layers is reduced to two. For Resenet101, I deleted the last two layers and the last softmax layer used for image classification.
Another modification takes place at the encoder side. The feedforward network consists of two convolution layers with a RELU activation function in between. The encoder side deals solely with the image part, where it is beneficial to exploit the relative position of the features we have. Refer to Figure 2 for the model architecture.

Figure 2: Model Architecture
Training
The transformer decoder output goes to one fully connected layer, which provides โ-given the previous tokenโ- a probability distribution
(, k is vocabulary size) for each token in the sequence.
I trained the model using cross - entropy loss given the target ground truth
() where T is the length of the sequence. Also, I add the doubly stochastic attention regularization [8] to the cross - entropy loss to penalize high weights in the encoder - decoder attention. This term encourages the summation of attention weights across the sequence to be approximatively equal to one. By doing so, the model will not concentrate on specific parts in the image when generating a caption. Instead, it will look all over the image, leading to a richer and more descriptive text [8].
The loss function is defined as:
\right)%20+%20\sum_{l=1}^{L}{\frac{1}{L}\left(\sum_{d=1}^{D}\sum_{i=1}^{P^2}\left(1-\sum_{c=1}^{T}\alpha_{cidl}\right)^2\right)}})
where D is the number of heads and L is the number of layers.
I used Adam optimizer, with a batch size of thirty - two. The reader can find the model sizes in the configuration file code/config.json
. Evaluation metrics used are Bleu [9], METEOR [10], and Gleu [11].
I trained the model for one hundred epochs, with stopping criteria if the tracked evaluation metric (bleu - 4) does not improve for twenty successive epochs. Also, the learning rate is reduced by 0.25% if the tracked evaluation metric (bleu - 4) does not improve for ten consecutive epochs. The evaluation of the model against the validation split takes place every two epochs.
The pre - trained Glove embeddings [12] initialize the word embedding weights. The words embeddings are frozen for ten epochs. The Resnet101 network is tuned from the beginning.
Inference
A beam search of size five is used to generate a caption for the images in the test split. The generation starts by feeding the image and the "start of sentence" special tokens. Then at each iteration, five tokens with the highest scores are chosen. The generation iteration stops when the "end of sentence" is generated or the max length limit is reached.
๐ง Technical Details
Steps for Code Explanation
1. Data Loading and Preprocessing
- Load Annotations: The code first loads image - caption pairs from the COCO 2017 dataset. It uses JSON files containing images and corresponding captions (
captions_train2017.json
). - Pairing Images and Captions: The code then creates a list (
img_cap_pairs
) that pairs image filenames with their respective captions. - Dataframe for Captions: It organizes the data in a pandas DataFrame for easier manipulation, including creating a path to each image file.
- Sampling Data: 70,000 image - caption pairs are randomly sampled, making the dataset manageable without needing all data.
2. Text Preprocessing
The code preprocesses captions to prepare them for the model. It lowercases the text, removes punctuation, replaces multiple spaces with single spaces, and adds [start]
and [end]
tokens, marking the beginning and end of each caption.
3. Tokenization
- Vocabulary Setup: A tokenizer (
TextVectorization
) is created with a vocabulary size of 15,000 words and a maximum token length of 40. It tokenizes captions, transforming them into sequences of integers. - Saving Vocabulary: The vocabulary is saved to a file so that it can be reused later without retraining.
- Mapping Words to Indexes:
word2idx
andidx2word
are mappings that convert words to indices and vice versa.
4. Dataset Preparation
- Image - Caption Mapping: Using a dictionary, each image is mapped to its list of captions. Then, the images are shuffled, and a train - validation split is made (80% for training, 20% for validation).
- Creating TensorFlow Datasets: Using the
load_data
function, images are resized, preprocessed, and tokenized captions are created as tensors. These tensors are batched for training and validation.
๐ License
This project is licensed under the Apache 2.0 license.






