Abstract

Attention mechanism has made great progress in image captioning, where semantic words or local regions are selectively embedded into the language model. However, current attention-based image captioning methods ignore the fine-grained semantic information and their interaction with visual regions. Inspired by the activity of human in describing an image: divergent observation and convergent attention, we propose a novel divergent-convergent attention (DCA) model to tackle the problems of the current attention model in image captioning. In our DCA model, divergent observation is mainly reflected in the multi-perspective inputs: a visual collection coming from object detection and three semantic components of scene graph made of objects, attributes and relations respectively. Then the convergent attention merges these multi-perspective inputs by adaptively deciding which perspective is crucial and which element in the focused perspective dominates in the attention process through a hierarchical structure. Our model also makes use of the interaction between visual objects and semantic components to achieve complementary advantages. Above all, owing to the interaction between divergent visual and semantic components, and the gradual convergence of attention, our model can attend to the corresponding local region more precisely under the guidance of semantic components. Besides, with the assistance of the visual components, the DCA model can effectively utilize the fine-grained semantic components to generate more descriptive sentences. Experiments on the MS COCO dataset demonstrate the superiority of our proposed method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.