Looking deeper and transferring attention for image captioning

Fang Fang,Yihao Chen,Hanli Wang,Pengjie Tang

doi:10.1007/s11042-018-6228-6

Abstract

Image captioning is a challenging task which requires not only to extract semantic information but also to generate descriptions with correct sentences. Most of the previous researches employ one-layer or two-layer Recurrent Neural Network (RNN) as the language model to predict sentence words. The language model may easily deal with the word information for a noun or an object, however, it may not be able to learn a verb or an adjective. To address this issue, a deep attention based language model is proposed to learn more abstract word information and three stacked approaches are designed to process attention. The proposed model makes full use of the Long Short Term Memory (LSTM) network and employs the transferred current attention to enhance extra spatial information. The experimental results on the benchmark MSCOCO and Flickr30K datasets have verified the effectiveness of the proposed model.

Full Text