Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

Xi Zhang,Changsheng Xu,Feifei Zhang

doi:10.1109/tmm.2021.3091882

Abstract

Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent approaches offer a great promise by exploring holistic attention mechanisms or graph-based networks, but most of them do implicit reasoning and ignore the semantic dependencies among the linguistic expression. In this paper, we propose a novel explicit cross-modal representation learning network for VCR by incorporating syntactic information into the visual reasoning and natural language understanding. The proposed method enjoys several merits. First, based on a two-branch neural module network, we can do explicit crossmodal reasoning guided by the high-level syntactic structure of linguistic expression. Second, the semantic structure of the linguistic expression is incorporated into a syntactic GCN to facilitate language understanding. Third, our explicit crossmodal representation learning network can provide a traceable reasoning-flow, which offers visible fine-grained evidence of the answer and rationale. Quantitative and qualitative evaluations on the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods. The full code for our work is available in the supplementary material.

Full Text