Abstract
Visual dialog demonstrates several important aspects of multimodal artificial intelligence; however, it is hindered by visual grounding and visual coreference resolution problems. To overcome these problems, we propose the novel neural module network for visual dialog (NMN-VD). NMN-VD is an efficient question-customized modular network model that combines only the modules required for deciding answers after analyzing input questions. In particular, the model includes a Refer module that effectively finds the visual area indicated by a pronoun using a reference pool to solve a visual coreference resolution problem, which is an important challenge in visual dialog. In addition, the proposed NMN-VD model includes a method for distinguishing and handling impersonal pronouns that do not require visual coreference resolution from general pronouns. Furthermore, a new Compare module that effectively handles comparison questions found in visual dialogs is included in the model, as well as a Find module that applies a triple-attention mechanism to solve visual grounding problems between the question and the image. The results of various experiments conducted using a set of large-scale benchmark data verify the efficacy and high performance of our proposed NMN-VD model.
Highlights
Recent developments in computer vision and natural language processing technologies have contributed to increasing interests in multimodal artificial intelligence, which involves the simultaneous understanding of images and language
We propose the novel neural module networkand forreduce visual dialog (NMN-VD), which can lower the complexity of neural network structures dialog (NMN-VD), which can lower the complexity of neural network structures and rethe number of parameters to be learned by selectively composing neural network modules duce the number of parameters to be learned by selectively composing neural network required for the processing of each question
To prove the efficacy of the neural module network for visual dialog (NMN-VD) model proposed in this study, some experiments were performed to compare the performance with existing visual dialog models
Summary
Recent developments in computer vision and natural language processing technologies have contributed to increasing interests in multimodal artificial intelligence, which involves the simultaneous understanding of images and language. As natural language questions are ule network (NMN) model for visual dialog [5] was adopted. Sensors 2021, 21, 931 monolithic neural network model, and it provides the basic neural network modules for visual dialog: Find, Refer, And, Relocate, Describe, and Compare. To solve the visual grounding problem, the proposed model adopts the Find module that applies a triple-attention mechanism to improve the find performance, unlike the Find modules of existing models that use singleor double-attention mechanisms To process these comparison questions effectively, the proposed model contains the Compare module that determines a minimum bounding area that includes two object regions in the image and extracts a context for comparison operation from this area.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have