Abstract

Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relational inference in a graphical model with sparse contextual subjects (nodes) and unknown graph structure (relation descriptor); how to model the underlying context-aware relational inference is critical. To this end, we propose a novel context-aware graph (CAG) neural network. We focus on the exploitation of fine-grained relational reasoning with object-level dialog-historical co-reference nodes. The graph structure (relation in dialog) is iteratively updated using an adaptive top- K message passing mechanism. To eliminate sparse useless relations, each node has dynamic relations in the graph (different related K neighbor nodes), and only the most relevant nodes are attributive to the context-aware relational graph inference. In addition, to avoid negative performance caused by linguistic bias of history, we propose a pure visual-aware knowledge distillation mechanism named CAG-Distill, in which image-only visual clues are used to regularize the joint dialog-historical contextual awareness at the object-level. Experimental results on VisDial v0.9 and v1.0 datasets show that both CAG and CAG-Distill outperform comparative methods. Visualization results further validate the remarkable interpretability of our graph inference solution.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call