Abstract
Image captioning is a multi-modal task to describe an image into natural language. Many state-of-the-art methods generally take the encoder–decoder architecture, encode an image by the convolution neural networks, or by the structured semantic scene graph that contains the object, relationship and the attribute information. The image scene graph constructed by the existing scene graph generation models are generally too noisy. To alleviate the phenomenon, we propose a multi-level cross-modal alignment (MCA) module to align the image scene graph with the sentence scene graph at different level. MCA can distill the redundant information of the image scene graph according to the sentence scene graph, and providing the commonsense knowledge for the decoder. Except for the semantic relationships, we take advantage of the bounding boxes with the visual objects to compute the implicit spatial relationships for the detected objects. With the aligned scene graph features and the implicit spatial relationship information, our decoder fused them via the dynamic mixtured attention to translate these features into descriptions. Extensive experiments on the MSCOCO dataset got the promising result compared with the state-of-the-art methods, which verified the effectiveness of our method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.