Abstract

Visual dialog, as a challenging cross-media task, requires answering a sequence of questions based on a given image and dialog history. Hence the key problem becomes how to answer visually grounded questions based on ambiguous reference information from dialog. In this work, we propose a novel method called Multi-View Semantic Understanding for Visual Dialog (MVSU) to resolve the visual coreference resolution problem. The model consists of two main textual processing modules, SRR (Semantic Retention RNN) and CRoT (Coreference Resolution on Text). Specifically, the SRR module generates word features that have semantical meaning by considering contextual information. The CRoT module is from a textual perspective to divide all useful nouns and pronouns into different clusters that serve as the supplement of the detailed information for semantic understanding. In experiments, we demonstrate that MVSU enhances the ability to understand the semantical information on the VisDial v1.0 dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call