Comparison of Different CNN Model used as Encoders for Image Captioning

Md Sahrial Alam,Khairul Anam Mubin,Md Sayedur Rahman,M F Mridha,Md Ikbal Hosen,Sharif Hossen

doi:10.1109/icdabi53623.2021.9655846

Abstract

Image captioning is the process of automatically describing the content of an image which connects computer vision and natural language processing. In this paper, we compare five popular convolutional neural networks architecture. They are Vgg16, InceptionV3, Resnet50, Densenet201 and Xception Model. By using these preprocessing model for same image captioning model. Encoder-Decoder model is a part of recurrent neural networks for sequence-to-sequence prediction problems. Encoders are usually used pre-trained convolutional neural networks for large datasets. There are many different types of Encoder-Decoder architecture used for generating caption. But it is very complicated to evaluate the performance of the architecture. In this paper, we used categorical-crossentropy for loss function, RMSprop for optimizer in Vgg16, Resnet50, InceptionV3, Densenet201 and Xception model.

Full Text