Abstract

Image captioning enables people to better understand images through fine-grained analysis. Recently the encoder-decoder architecture with attention mechanism has achieved great achievements in image captioning and visual question answering. In this paper, we propose a new captioning algorithm that integrates two separate LSTM (Long-short Term Memory) networks through an adaptive semantic attention model. Within our approach, the first LSTM network is followed by an attention model, which serves as a visual sentinel can flexibly make a trade off between the visual semantic region and textual content. Another LSTM is used as a language model, which combines the hidden state representation of the first LSTM and attention context vector, then outputs the word sequence. The proposed model has been extensively evaluated on two large-scale datasets: MSCOCO and Flickr30k. Experimental results show that the proposed method pays more attention to visual salient regions and achieves significant performance of prior state-of-the-art approaches on multiple evaluation metrics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.