Neural Dense Captioning with Visual Attention

Akshat Arvind Jadhav,Ankita Yashwant Joshi,Aarti Vikas Kale,Jyoti Kundale

doi:10.1109/csnt51715.2021.9509657

Akshat Arvind Jadhav, Ankita Yashwant Joshi + Show 2 more

https://doi.org/10.1109/csnt51715.2021.9509657

Copy DOI

Export

Save

Cite

Publication Date: Jun 18, 2021

Affiliation: Aditya Birla (India)

Abstract
Full-Text
Similar Papers

Abstract

Listen

Image captioning is the task of how to represent image information more effectively and efficiently. It uses both Natural Language Processing and Computer Vision for generating the caption. In the existing research, many methods for caption generation were proposed, but the information in the description is not so related to the image and also most of them uses the global information of the image. In order to solve this problem, in this paper we have described three image captioning model namely ResNet, Glove and VGG16.These are the pretrained models which uses Convolutional Neural Network (CNN) and Deep Recurrent Neural Network (RNN) based on Long Short Term Memory (LSTM). CNN has one or more convolutional layers and are used for image classification and features extraction. Only the meaningful area's features were extracted through CNN. For generating the caption LSTM uses over RNN because it has many layers so the predictions for generating the captions are more accurate in LSTM. The dataset use in our project is Flicker8k dataset. It contains total 8k images where 2k images are test dataset and 6k images are trained dataset. The Flicker8k dataset is used to demonstrate the proposed methodology using python as a language.

Full Text