Abstract

Most state-of-the-art methods for image captioning highly depend on an attention mechanism on the object regions within the encoder-decoder framework. Generally, existing attention models are based on simple addition or multiplication operations and may not fully discover the complex relationships between the visual features and the target words. In this paper, we propose a novel attention model, named graph self-attention (GSA), that incorporates graph networks and self-attention for image captioning. GSA constructs a star-graph model to dynamically assign weights to the detected object regions when generating the words step-by-step. The central node is represented by the semantic feature and the visual features of the object regions are used as edge nodes. Through propagating messages among the center and edge nodes, GSA explicitly captures the relationships between the current target word and the image features. To generate conjunctions and attributives that are not directly related to visual information, GSA introduces self-attention so that such words are allowed to focus more on the semantic information. Moreover, the GSA model is also generic and can be applied to tasks that require attention to multiple features. The experiments show the effectiveness and potentiality of our proposed GSA.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call