Abstract

The multimodal models used in the emerging field at the intersection of computational linguistics and computer vision implement the bottom-up processing of the “Hub and Spoke” architecture proposed in cognitive science to represent how the brain processes and combines multi-sensory inputs. In particular, the Hub is implemented as a neural network encoder. We investigate the effect on this encoder of various vision-and-language tasks proposed in the literature: visual question answering, visual reference resolution, and visually grounded dialogue. To measure the quality of the representations learned by the encoder, we use two kinds of analyses. First, we evaluate the encoder pre-trained on the different vision-and-language tasks on an existing “diagnostic task” designed to assess multimodal semantic understanding. Second, we carry out a battery of analyses aimed at studying how the encoder merges and exploits the two modalities.

Highlights

  • In recent years, a lot of progress has been made within the emerging field at the intersection of computational linguistics and computer vision thanks to the use of deep neural networks

  • We consider three visually grounded tasks: visual question answering (VQA) (Antol et al, 2015), where the encoder is trained to answer a question about an image; visual resolution of referring expressions (ReferIt) (Kazemzadeh et al, 2014), where the model has to pick up the referent object of a description in an image; and GuessWhat, where the model has to identify the object in an image that is the target of a goal-oriented question-answer dialogue

  • We carry out two main blocks of analyses: one exploiting FOIL as diagnostic task and the other one investigating the structure of the semantic spaces produced by the pre-trained encoders when receiving the same multimodal inputs

Read more

Summary

Introduction

A lot of progress has been made within the emerging field at the intersection of computational linguistics and computer vision thanks to the use of deep neural networks. The benchmarks developed so far have put forward complex and distinct neural architectures, but in general they all share a common backbone consisting of an encoder which learns to merge the two types of representation to perform a certain task. This resembles the bottom-up processing in the ‘Hub and Spoke’ model proposed in Cognitive Science to represent how the brain processes and combines multi-sensory inputs (Patterson and Ralph, 2015). We focus our attention on the encoder implementing the ‘hub’ in artificial multimodal systems, with the goal of assessing its ability to compute multimodal representations that are useful beyond specific tasks

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.