Visual Dialog aims to answer an appropriate response based on a multi-round dialog history and a given image. Existing methods either focus on semantic interaction, or implicitly capture coarse-grained structural interaction (e.g., pronoun co-references). The fine-grained and explicit structural interaction feature for dialog history is seldom explored, resulting in insufficient feature learning and difficulty in capturing precise context. To address these issues, we propose a structure-aware dual-level graph interactive network (SDGIN) that integrates verb-specific semantic roles and co-reference resolution to explicitly capture context structural features for discriminative and generative tasks in visual dialog. Specifically, we create a novel structural interaction graph that injects syntactic knowledge priors into dialog by introducing semantic role labeling that imply which words are sentence stems. Furthermore, considering the single perspective limitation of previous algorithms, we design a dual-perspective mechanism that learns fine-grained token-level context structure features and coarse-grained utterance-level interactions in parallel. It possess an elegant view to explore precise context interactions, realizing the mutual complementation and enhancement of different granularity features. Experimental results show the superiority of our proposed approach. Compared to other task-specific models, our SDGIN outperforms previous models and achieves a significant improvement on the benchmark dataset VisDial v1.0.
Read full abstract