Abstract

Image captioning enables people to better understand images through fine-grained analysis. Recently the encoder-decoder architecture with attention mechanism has achieved great achievements in image captioning and visual question answering. In this paper, we propose a new captioning algorithm that integrates two separate LSTM (Long-short Term Memory) networks through an adaptive semantic attention model. Within our approach, the first LSTM network is followed by an attention model, which serves as a visual sentinel can flexibly make a trade off between the visual semantic region and textual content. Another LSTM is used as a language model, which combines the hidden state representation of the first LSTM and attention context vector, then outputs the word sequence. The proposed model has been extensively evaluated on two large-scale datasets: MSCOCO and Flickr30k. Experimental results show that the proposed method pays more attention to visual salient regions and achieves significant performance of prior state-of-the-art approaches on multiple evaluation metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call