Multi-view semantic understanding for visual dialog

Tianling Jiang,Zefan Zhang,Xin Li,Yi Ji,Chunping Liu

doi:10.1016/j.knosys.2023.110427

Tianling Jiang, Zefan Zhang + Show 3 more

https://doi.org/10.1016/j.knosys.2023.110427

Copy DOI

Export

Save

Cite

Journal: Knowledge-Based Systems	Publication Date: Mar 5, 2023
Citations: 1

Affiliation: Soochow University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Visual dialog, as a challenging cross-media task, requires answering a sequence of questions based on a given image and dialog history. Hence the key problem becomes how to answer visually grounded questions based on ambiguous reference information from dialog. In this work, we propose a novel method called Multi-View Semantic Understanding for Visual Dialog (MVSU) to resolve the visual coreference resolution problem. The model consists of two main textual processing modules, SRR (Semantic Retention RNN) and CRoT (Coreference Resolution on Text). Specifically, the SRR module generates word features that have semantical meaning by considering contextual information. The CRoT module is from a textual perspective to divide all useful nouns and pronouns into different clusters that serve as the supplement of the detailed information for semantic understanding. In experiments, we demonstrate that MVSU enhances the ability to understand the semantical information on the VisDial v1.0 dataset.

Full Text