Abstract

As a cross-media intelligence task, visual dialog calls for answering a sequence of questions based on an image, using the dialog history as context. To acquire correct answers, the exploration of the semantic dependencies among potential visual and textual contents becomes vital. Prior works usually ignored the underlying knowledge hidden in internal and external textual-visual relationships, which resulted in unreasonable inferring. In this paper, we propose an Aligning Vision-Language for Graph Inference (AVLGI) in visual dialog by combining the internal context-aware information and the external scene graph knowledge. Compared with other approaches, it makes up the lack of structural inference in visual dialog. So the whole system consists of three modules, Inter-Modalities Alignment (IMA), Visual Graph Attended by Text (VGAT) and Combining Scene Graph and Textual Contents(CSGTC). Specifically, the IMA module aims at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. And the VGAT module views the visual features with semantic information as observed nodes and measures the weight of importance between each two nodes in visual graph. The CSGTC supplements various relationships between visual objects by introducing additional information of the scene graph. We also qualitatively and quantitatively evaluate the model on VisDial v1.0 dataset, showing our AVLGI outperforms previous state-of-the-art models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.