Image caption generation with dual attention mechanism

Maofu Liu,Lingjun Li,Huijun Hu,Weili Guan,Jing Tian

doi:10.1016/j.ipm.2019.102178

Abstract

As a crossing domain of computer vision and natural language processing, the image caption generation has been an active research topic in recent years, which contributes to the multimodal social media translation from unstructured image data to structured text data. The conventional research works have proposed a series of image captioning methods, such as template-based, retrieval-based, encode-decode. Among these methods, the one with encode-decode framework is widely used in the image caption generation, in which the encoder extracts the image features by Convolutional Neural Network (CNN), and the decoder adopts Recurrent Neural Network (RNN) to generate the image description. The Neural Image Caption (NIC) model has achieved good performance in image captioning, and however, there still remains some challenges to be addressed. To tackle the challenges of the lack of image information and the deviation from the core content of the image, our proposed model explores visual attention to deepen the understanding of the image, incorporating the image labels generated by Fully Convolutional Network (FCN) into the generation of image caption. Furthermore, our proposed model exploits textual attention to increase the integrity of the information. Finally, the label generation, attached to the textual attention mechanism, and the image caption generation, have been merged to form an end-to-end trainable framework. In this paper, extensive experiments have been carried out on the AIC-ICC image caption benchmark dataset, and the experimental results show that our proposed model is effective and feasible in the image caption generation.

Full Text