Pre-trained CNNs as Feature-Extraction Modules for Image Captioning

Muhammad Abdelhadie Al-Malla,Nada Ghneim,Assef Jafar

doi:10.5565/rev/elcvia.1436

Abstract

In this work, we present a thorough experimental study about feature extraction using Convolutional NeuralNetworks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features areextracted from the last layer after removing the fully connected layer and fed into the captioning model. We usea unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changingthe CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics inimage captioning. We find a strong relationship between the model structure and the image captioning datasetand prove that VGG models give the least quality for image captioning feature extraction among the testedCNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metricswe want to optimise, and show the connection between our results and previous works. To our knowledge, thiswork is the most comprehensive comparison between feature extractors for image captioning.

Full Text