Abstract

Visual dialog is a fundamental vision-language task where an AI agent holds a meaningful dialogue about visual content with humans in nature. However, this task remains challenging, since there is still no consensus way to capture rich visual contextual information contained in the environment rather than only focusing on visual objects. Furthermore, conventional methods suffer from the single-answer learning strategy, where it only accepts one correct answer without considering the diverse expressions of the language (i.e., one identical meaning but multiple expressions via rephrasing or adopting synonyms etc). In this paper, we introduce Contextual-Aware Representation and linguistic-diverse Expression (CARE), a novel plug-and-play framework with contextual-based graph embedding and curriculum contrastive learning to solve the above two issues. Specifically, the contextual-based graph embedding (CGE) module aims to integrate the environmental context information with visual objects to improve the answer quality. In addition, we propose a curriculum contrastive learning (CCL) paradigm to imitate the learning habits of humans when facing a question with multiple correct answers sharing the same meaning but with diverse expressions. To support CCL, a CCL loss is designed to progressively strengthen the model's ability in identifying the answers with correct semantics. Extensive experiments are conducted on two benchmark datasets, and our proposed method outperforms the state-of-the-arts by a considerable margin on VisDial V1.0 (4.63% NDCG) and VisDial V0.9 (1.27% MRR, 1.74% R@1, 0.87% R@5, 1.28% R@10, 0.26 Mean.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call