Abstract

Referring expression comprehension, which is the ability to locate language to an object in an image, plays an important role in creating common ground. Many models that fuse visual and linguistic features have been proposed. However, few models consider the fusion of linguistic features with multiple visual features with different sizes of receptive fields, though the proper size of the receptive field of visual features intuitively varies depending on expressions. In this paper, we introduce a neural network architecture that modulates visual features with varying sizes of receptive field by linguistic features. We evaluate our architecture on tasks related to referring expression comprehension in two visual dialogue games. The results show the advantages and broad applicability of our architecture. Source code is available at https://github.com/Alab-NII/lcfp .

Highlights

  • Referring expressions are a ubiquitous part of human communication (Krahmer and Van Deemter, 2012) that must be studied in order to create machines that work smoothly with humans

  • We introduce a neural network architecture for referring expression comprehension considering visual features with different sizes of the receptive fields, and evaluate it on the OneCommon task

  • To confirm the broad applicability of our architecture, we further evaluate it on another task, which is expected to require the ability of object category recognition more than OneCommon does because it uses photographs

Read more

Summary

Introduction

Referring expressions are a ubiquitous part of human communication (Krahmer and Van Deemter, 2012) that must be studied in order to create machines that work smoothly with humans. Much effort has been taken to improve methods of creating visual common ground between machines, which have limited means of expression and knowledge about the real world, and humans, from the perspectives of both referring expression comprehension and generation (Moratz et al, 2002; Tenbrink and Moratz, 2003; Funakoshi et al, 2004, 2005, 2006; Fang et al, 2013). Many models have been proposed for referring expression comprehension so far. As image recognition matured, Guadarrama et al (2014) studied object retrieval methods based on category labels predicted by the recognition models. Models that fuse linguistic features with visual features using deep learning have been studied (Hu et al, 2016b,a; Anderson et al, 2018; Deng et al, 2018; Misra et al, 2018; Li et al, 2018; Yang et al, 2019a,b; Liu et al, 2019; Can et al, 2020)

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.