Abstract

Spatial understanding is crucial in many real-world problems, yet little progress has been made towards building representations that capture spatial knowledge. Here, we move one step forward in this direction and learn such representations by leveraging a task consisting in predicting continuous 2D spatial arrangements of objects given object-relationship-object instances (e.g., “cat under chair”) and a simple neural network model that learns the task from annotated images. We show that the model succeeds in this task and, furthermore, that it is capable of predicting correct spatial arrangements for unseen objects if either CNN features or word embeddings of the objects are provided. The differences between visual and linguistic features are discussed. Next, to evaluate the spatial representations learned in the previous task, we introduce a task and a dataset consisting in a set of crowdsourced human ratings of spatial similarity for object pairs. We find that both CNN (convolutional neural network) features and word embeddings predict human judgments of similarity well and that these vectors can be further specialized in spatial knowledge if we update them when training the model that predicts spatial arrangements of objects. Overall, this paper paves the way towards building distributed spatial representations, contributing to the understanding of spatial expressions in language.

Highlights

  • Representing spatial knowledge is instrumental in any task involving text-to-scene conversion such as robot understanding of natural language commands (Guadarrama et al, 2013; Moratz and Tenbrink, 2006) or a number of robot navigation tasks

  • To evaluate the quality of the spatial representations learned in the previous task, we introduce a task consisting in a set of 1,016 human ratings of spatial similarity between object pairs

  • To learn spatial representations we have leveraged the task of predicting the continuous 2D relative spatial arrangement of two objects under a relationship, and a simple embedding-based neural model that learns this task from annotated images

Read more

Summary

Introduction

Representing spatial knowledge is instrumental in any task involving text-to-scene conversion such as robot understanding of natural language commands (Guadarrama et al, 2013; Moratz and Tenbrink, 2006) or a number of robot navigation tasks. One may reasonably expect that the more attributes two objects share (e.g., size, functionality, etc.), the more likely they are to exhibit similar spatial arrangements with respect to other objects. Leveraging this intuition, we foresee that visual and linguistic representations can be spatially informative about unseen objects as they encode features/attributes of objects (Collell and Moens, 2016). In this paper we systematically study how informative visual and linguistic features—in the form of convolutional neural network (CNN) features and word embeddings—are about the spatial behavior of objects

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.