Abstract

The key challenge of the visual dialog task is how to fuse features from multimodal sources and extract relevant information from dialog history to answer the current query. In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors.For featurization, we use a Dense SymmetricCo-Attention network (Nguyen and Okatani,2018) as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM(Hochreiter and Schmidhuber,1997) for information propagation (IP) and the second uses a modified Transformer (Vaswani et al.,2017) for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. The code is available at https://github.com/xiaoxiaoheimei/SeqDialN.

Highlights

  • Our work is inspired by the use of visuallinguistic joint representation to erase the modality gap, where we embed the visual signals into the text snippets for each dialog round

  • On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG1 and 48.63% MRR2; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model

  • The vision feature of image I is represented as FI ∈ Rnv×dv, where nv = 36 being the number of object-like region proposals in the progressively update the semantic representations of query, vision, and history, making them coaware through multiple steps to achieve multi-step reasoning

Read more

Summary

Related Work

VQA focuses on providing a natural language answer given an image and a free-form, open-ended question. Attention mechanisms have been deeply explored in VQA related work. The attention mechanism helps refine semantic meanings at different levels. (Lu et al, 2016) introduces co-attention between image regions and words in the question. Dense Co-Attention Network (DCN) (Nguyen and Okatani, 2018) proposes a symmetric co-attention layer to address VQA tasks. DCN is ”dense symmetric” because it makes each visual region aware of the existence of each question word and vice versa. This fine-granularity co-attention enables DCN to discriminate subtle differences or similarities between vision and language features. We use DCN as the generator of joint visual-linguistic representation

Visual Dialog
Approach
SeqIPN
SeqMRN
History Backward Self-Attention Layer
Query Correction Layer
Multi-step Reasoning
Decoder Module
Generative Decoder
Reweighting Method in Fine-tuning with Dense Annotations
Experiments
Model Comparison
Ensemble SeqDialN Analysis
Ablation Study
Fine-tuning with Dense Annotations
Effectiveness of Query Correction Layer
Qualitative Analysis
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call