Abstract

The Visual Dialog task requires retrieving the correct answer based on detected objects, a current question, and history dialogs. However, in real-world scenarios, most existing models face the hard-positive problem and are unable to reason about unseen features, which limits their generalization ability. To address this issue, we propose two Relation Regularized Modules (RRM) in this article. The first is the Visual Relation Regularized Module (VRRM), which seeks known visual features that have semantic relations with unknown visual features and leverages these known features to assist in understanding the unknown features. The second is the Text Relation Regularized Module (TRRM), which enhances the keywords in the answers to strengthen the understanding of unknown text features. To evaluate the effectiveness of these modules, we propose two zero-shot Visual Dialog splits for verification: Visual Zero-shot VisDial with unseen visual features and Text Zero-shot VisDial with unseen answers. Experimental results demonstrate that our proposed modules achieve state-of-the-art performance in zero-shot Visual Dialog with unseen visual features and unseen answers, while also producing comparable results on the benchmark VisDial v1.0 test dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call