Multilayer Dense Attention Model for Image Caption

Ke Wang,Chien-Ming Chen,Fan Wang,Tsu-Yang Wu,Xun Zhang

doi:10.1109/access.2019.2917771

Ke Wang, Chien-Ming Chen + Show 3 more

Open Access

https://doi.org/10.1109/access.2019.2917771

Copy DOI

Abstract

The image caption is a technology that enables us to understand the contents and generate descriptive text, of images using machines. With the development of deep learning, means of using it to understand image content and generate descriptive text has become a hot research topic. This paper proposes a multilayer dense attention model for image caption. A faster recurrent convolutional neural networks (Faster R-CNN) is employed to extract image features as the coding layer, the long short-term memory (LSTM)-attend is used to decode the multilayer dense attention model, and the description text is generated. The model parameters are optimized using strategy gradient optimization in reinforcement learning. Use of dense attention mechanisms in the coding layer can effectively avoid the interference of non-salient information and selectively output the corresponding description text for the decoding process. The experimental results in the field of general images validate the model's good ability to understand images and generating text.

Full Text