A Comparative Analysis on Image Caption Generator Using Deep Learning Architecture—ResNet and VGG16

V Sri Neha,B Nikhila,K Deepika,T Subetha

doi:10.1007/978-981-16-9573-5_15

Abstract

AbstractImage caption generator generates the caption for a given image by understanding the image. The functionality is that it involves numerous concepts of computer vision to identify the image and to reciprocate the same in English. The challenging part of the caption generation is to understand the image and understand the image context and produce English description for the image. In our work, we compared the abilities of two deep learning architectures named VGG16 and ResNet50 for understanding the image and LSTM for generating the relevant caption for the image. The paper discusses about the usage of two deep learning architectures on generating the captions from the photograph. With the advancements in the deep learning techniques, the Flickr8k datasets are taken that have high dimensionality to compare the performance of the caption generated. The Flickr8k dataset has 8000 images where every image is grouped with five varied captions that determine the appropriate content of the image. The high computational power of the deep learning techniques is helpful to build models that can generate captions for picture. The two deep learning architectures performance is compared using BLEU score. The widely used applications of image caption generator are to describe caption for photograph so that blind can understand the image.KeywordsImage caption generatorVGG16ResNet50Flickr8k datasetDeep learning

Full Text