Image Captioning

Purpose:

The purpose of this part is to able to search and get the images similar to that search query. In this we have a model which generates caption and then we store the captions for a set of images and search for the input query among those captions to generate the results.

Preprocessing of data

The flickr data set contains many images. But for my search engine implementation I randomly selected 1,000 images. These images are uploaded on my github.

STEPS:

The directory on github can only show first 1000 entries.
The data is fed to the Inception V3 model for testing.

Algorithm:

TRAINING INCEPTION V3 MODEL:

The model works on captioning with attention and is an encoder-decoder model.

It uses MS COCO Dataset with more than 82,000 images and 400,000 captions.
We use a subset of 30k images.
The input for the model is images with size 299px x 299px and normalize the image so that it contains pixels in the range of -1 to 1.
We will classify using the model which was pretrained on Imagenet.
Last layer in the model is a convolution layer since we are using attention mechanism.
Output for the last layer is of size 8x8x2048
features extracted from the model are stored in .npy files of each image.

TOKENIZING THE CAPTIONS:

Following steps are done in order to get the unique terms:

We Split the data on the spacing.
Stores first 5000 words in vocabulary, rest all are stored as “UNK”.
Word to index mapping is done.
We split the data for train set and test set.

MODEL:

The output for the model was a vector of shape 8x8x2048 which is changed to 64×2048.
The vector is passed to Convolution Neural Network Encoder (which is a fully connected layer).
Then, RNN attends over the image for the next word.

TRAINING:

We use our training data for this.
Extracted features were loaded in encoder from the respective .npy files of the images.
Hidden state from the encoder is passed to the decoder in order to get the decoder’s hidden state and the prediction.
Decoder’s hidden state is back propagated to calculate the loss.
Last step is to calculate the gradients and optimize.

graph to show the loss over the epochs

caption generated using attention

STEPS (after the Inception V3 model was trained) :

Once we have the trained model we are ready to generate captions for our dataset of 1000 images.
In colab with the trained model I wrote a block of code to iterate on my dataset and store the URL of the image and the caption in a .csv file for further task.
Similar to the first task, I implemented Tf-Idf search on this file. The input query was tokenized and was searched in the captions stored. Top 10 results were displayed with the images

For iterating images and saving them in a file:

import csv
with open('annotations.csv', 'a') as csvFile:
  writer = csv.writer(csvFile)
  for i in range(1000:2000):
    image_url = "https://raw.githubusercontent.com/ruchirchugh/Data-Mining/master/image_caption_data/" + str(i) + ".jpg"
    image_extension = image_url[-5:]
    image_path = tf.keras.utils.get_file(('new'+str(i))+image_extension, 
                                     origin=image_url)
    result, attention_plot = evaluate(image_path)
    writer.writerows([[str(i), image_url, result]])
    print("Predicted Caption", ' '.join(result))
  csvFile.close()
  print("done")

Calculating score for each result:

Contributions:

Made a dataset of 1000 images available online so that Inveption V3 model can be tested on that.
The trained model was tested on these images and captions were generated and was stored in a csv file with the URL of the images for search purpose. Iteration on this dataset and data storing was done by me.
Saving the captions and keeping the url with the image number was important to store which was used in the later problem. That code was implemented by me.
Tuned my code of text searching for this problem.
Stored the weights of the encoder and decoder of the model so that it can be used offline later on.

Experiments:

I tried hosting the image captioning model on pythonanywhere by keeping the learnt weights stored from the trained model so that we can predict caption of any input image.
For doing this it required more space as the tensorflow needs more than 450Mb of space.
Pythonanywhere has space issue so wasn’t able to implement this on that.
I was able to run the model offline on my system and generate the captions for any input image and also was searching that caption in my csv file to get similar results.

Challenges:

Training inception model multiple times consumed a lot of hours as it takes 3.5 hours.
It was trained multiple time because if you do not stay connect to Colab it gets disconnected and then all the data is lost.
Even to save the model I had to run the model with only CPU settings which took more time than usual.
While generating the captions on test dataset, the model was generating same captions for every image. There was a small mistake which was causing this error. I found that a certain paramter in tf.keras.util.get_file has been hard coded. Hence, I observed that first paramter is file name which was remaining same for all the images. So, I fixed by changing the file name also with each iteration.

Limitations:

For some cases model generates wrong captions and for others it generates the repeated words.

References:

PROJECT WEBSITE