SeqDialN: Sequential Visual Dialog Network in Joint Visual-Linguistic Representation Space

Vicent Ying,Fanqi Meng,Ming-Kuang Daniel Wu,Liu Yang,Xiao Liu,James Xu

doi:10.18653/v1/2021.dialdoc-1.2

Abstract

The key challenge of the visual dialog task is how to fuse features from multimodal sources and extract relevant information from dialog history to answer the current query. In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors.For featurization, we use a Dense SymmetricCo-Attention network (Nguyen and Okatani,2018) as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM(Hochreiter and Schmidhuber,1997) for information propagation (IP) and the second uses a modified Transformer (Vaswani et al.,2017) for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. The code is available at https://github.com/xiaoxiaoheimei/SeqDialN.

Highlights

Our work is inspired by the use of visuallinguistic joint representation to erase the modality gap, where we embed the visual signals into the text snippets for each dialog round
On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG1 and 48.63% MRR2; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model
The vision feature of image I is represented as FI ∈ Rnv×dv, where nv = 36 being the number of object-like region proposals in the progressively update the semantic representations of query, vision, and history, making them coaware through multiple steps to achieve multi-step reasoning

Summary

Related Work

VQA focuses on providing a natural language answer given an image and a free-form, open-ended question. Attention mechanisms have been deeply explored in VQA related work. The attention mechanism helps refine semantic meanings at different levels. (Lu et al, 2016) introduces co-attention between image regions and words in the question. Dense Co-Attention Network (DCN) (Nguyen and Okatani, 2018) proposes a symmetric co-attention layer to address VQA tasks. DCN is ”dense symmetric” because it makes each visual region aware of the existence of each question word and vice versa. This fine-granularity co-attention enables DCN to discriminate subtle differences or similarities between vision and language features. We use DCN as the generator of joint visual-linguistic representation

Visual Dialog

Approach

SeqIPN

SeqMRN

History Backward Self-Attention Layer

Query Correction Layer

Multi-step Reasoning

Decoder Module

Generative Decoder

Reweighting Method in Fine-tuning with Dense Annotations

Experiments

Model Comparison

Ensemble SeqDialN Analysis

Ablation Study

Fine-tuning with Dense Annotations

Effectiveness of Query Correction Layer

Qualitative Analysis

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SeqDialN: Sequential Visual Dialog Network in Joint Visual-Linguistic Representation Space

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 2	License type: cc-by

Similar Papers

Enhancing multi-modal fusion in visual dialog via sample debiasing and feature interaction
Chenyu Lu ... Shiliang Sun
Information Fusion | VOL. 107
Chenyu Lu, et. al.Chenyu Lu ... Shiliang Sun
14 Feb 2024
Information Fusion | VOL. 107

The World in My Mind: Visual Dialog with Adversarial Multi-modal Feature Encoding
Yiqun Yao ... Bo Xu
-
Yiqun Yao, et. al.Yiqun Yao ... Bo Xu
01 Jan 2019
01 Jan 2019

Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue.
Weifeng Zhang ... Xiaoze Jiang
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. 30
Weifeng Zhang, et. al.Weifeng Zhang ... Xiaoze Jiang
18 Nov 2020
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. 30

Edge Detection Guide Network for Semantic Segmentation of Remote-Sensing Images
Lu Yu ... Jianhui Jin
IEEE Geoscience and Remote Sensing Letters | VOL. 20
Lu Yu, et. al.Lu Yu ... Jianhui Jin
01 Jan 2023
IEEE Geoscience and Remote Sensing Letters | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SeqDialN: Sequential Visual Dialog Network in Joint Visual-Linguistic Representation Space

Abstract

Highlights

Summary

Talk to us

Similar Papers