Abstract

While deep neural networks have recently achieved promising results on the image captioning task, they do not explicitly use the structural visual and textual knowledge within an image. In this work, we propose the Scene Graph Captioner (SGC) framework for the image captioning task, which captures the comprehensive structural semantic of visual scene by explicitly modeling objects, attributes of objects, and relationships between objects. Firstly, we develop an approach to generate the scene graph by learning individual modules on the large object, attribute and relationship datasets. Then, SGC incorporates high-level graph information and visual attention information into a deep captioning framework. Specifically, we propose a novel framework to embed a scene graph into the structural representation, which captures the semantic concepts and the graph topology. Further, we develop the scene-graph-driven method to generate the attention graph by exploiting high internal homogeneity and external inhomogeneity among the nodes in the scene graph. Finally, a LSTM-based framework translates these information into text. We evaluate the proposed framework on a held-out MSCOCO dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call