Abstract

Visual dialog demonstrates several important aspects of multimodal artificial intelligence; however, it is hindered by visual grounding and visual coreference resolution problems. To overcome these problems, we propose the novel neural module network for visual dialog (NMN-VD). NMN-VD is an efficient question-customized modular network model that combines only the modules required for deciding answers after analyzing input questions. In particular, the model includes a Refer module that effectively finds the visual area indicated by a pronoun using a reference pool to solve a visual coreference resolution problem, which is an important challenge in visual dialog. In addition, the proposed NMN-VD model includes a method for distinguishing and handling impersonal pronouns that do not require visual coreference resolution from general pronouns. Furthermore, a new Compare module that effectively handles comparison questions found in visual dialogs is included in the model, as well as a Find module that applies a triple-attention mechanism to solve visual grounding problems between the question and the image. The results of various experiments conducted using a set of large-scale benchmark data verify the efficacy and high performance of our proposed NMN-VD model.

Highlights

  • Recent developments in computer vision and natural language processing technologies have contributed to increasing interests in multimodal artificial intelligence, which involves the simultaneous understanding of images and language

  • We propose the novel neural module networkand forreduce visual dialog (NMN-VD), which can lower the complexity of neural network structures dialog (NMN-VD), which can lower the complexity of neural network structures and rethe number of parameters to be learned by selectively composing neural network modules duce the number of parameters to be learned by selectively composing neural network required for the processing of each question

  • To prove the efficacy of the neural module network for visual dialog (NMN-VD) model proposed in this study, some experiments were performed to compare the performance with existing visual dialog models

Read more

Summary

Introduction

Recent developments in computer vision and natural language processing technologies have contributed to increasing interests in multimodal artificial intelligence, which involves the simultaneous understanding of images and language. As natural language questions are ule network (NMN) model for visual dialog [5] was adopted. Sensors 2021, 21, 931 monolithic neural network model, and it provides the basic neural network modules for visual dialog: Find, Refer, And, Relocate, Describe, and Compare. To solve the visual grounding problem, the proposed model adopts the Find module that applies a triple-attention mechanism to improve the find performance, unlike the Find modules of existing models that use singleor double-attention mechanisms To process these comparison questions effectively, the proposed model contains the Compare module that determines a minimum bounding area that includes two object regions in the image and extracts a context for comparison operation from this area.

Related Work
Visual Dialog
Neural Module Network
Model Overview
As shown
The program executor extracts the visual features x from the input image
Neural Network Module
Refer Module
Processing Impersonal Pronouns
Datasets and Model Training
Performance Comparison with Existing Models
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call