Abstract
Image captioning is the task of how to represent image information more effectively and efficiently. It uses both Natural Language Processing and Computer Vision for generating the caption. In the existing research, many methods for caption generation were proposed, but the information in the description is not so related to the image and also most of them uses the global information of the image. In order to solve this problem, in this paper we have described three image captioning model namely ResNet, Glove and VGG16.These are the pretrained models which uses Convolutional Neural Network (CNN) and Deep Recurrent Neural Network (RNN) based on Long Short Term Memory (LSTM). CNN has one or more convolutional layers and are used for image classification and features extraction. Only the meaningful area's features were extracted through CNN. For generating the caption LSTM uses over RNN because it has many layers so the predictions for generating the captions are more accurate in LSTM. The dataset use in our project is Flicker8k dataset. It contains total 8k images where 2k images are test dataset and 6k images are trained dataset. The Flicker8k dataset is used to demonstrate the proposed methodology using python as a language.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have