Abstract

• We propose to leverage the rich visual concepts and the structured knowledge for dense caption generation. • We use the objective function of scene graph generation to propagate the structured knowledge through the refining pipeline. • Expcrimental results and qualitative experiments confirm the cffect of our model. Dense captioning task aims to both localize and describe salient regions in images with natural languages. It can benefit from the rich visual concepts, including objects, pair-wise relationships and so on. However, due to the challenging combinatorial complexity of formulating < subject-predicate-object > triplets, very little work has been done to integrate them into the dense captioning task. Inspired by the recent success in scene graph generation for object and relationship detections, we propose a scene-graph-guided message passing network for dense caption generation. We first exploit message passing between objects and their relationships with a feature refining structure. Moreover, we formulate the message passing as the inter-connected visual concept generation problem while the objective function of scene graph generation is used to guide the region feature learning. Scene graph guide can propagate the structured knowledge of graph through the concept-region message passing mechanism (CR-MPM), which can improve the regional feature representation. Finally, the refined regional features are encoded by a LSTM-based decoder to generate dense captions. Our model can achieve competing performances on Visual Genome comparing against existing methods. Qualitative experiments also confirm the effect of our model in the dense captioning task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call