Abstract

The existing attention models of image captioning typically extract only word-level attention information, i.e., the attention mechanism extracts local attention information from the image to generate the current word, and lacks accurate image global information guidance. In this paper, we first propose an image captioning approach based on self-attention. Sentence-level attention information is extracted from the image through self-attention mechanism to represent the global image information needed to generate sentences. Furthermore, we propose a double attention model which combines sentence-level attention model with word-level attention model to generate more accurate captions. We implement supervision and optimization in the intermediate stage of the model to solve information interference problems. In addition, we perform two-stage training with reinforcement learning to optimize the evaluation metric of the model. Finally, we evaluated our model on three standard datasets, i.e., Flickr8k, Flickr30k and MSCOCO. Experimental results show that our double attention model can generate more accurate and richer captions, and outperforms many state-of-the-art image captioning approaches in various evaluation metrics.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.