Learning Representations Specialized in Spatial Knowledge: Leveraging Language and Vision

Guillem Collell,Marie-Francine Moens

doi:10.1162/tacl_a_00010

Abstract

Spatial understanding is crucial in many real-world problems, yet little progress has been made towards building representations that capture spatial knowledge. Here, we move one step forward in this direction and learn such representations by leveraging a task consisting in predicting continuous 2D spatial arrangements of objects given object-relationship-object instances (e.g., “cat under chair”) and a simple neural network model that learns the task from annotated images. We show that the model succeeds in this task and, furthermore, that it is capable of predicting correct spatial arrangements for unseen objects if either CNN features or word embeddings of the objects are provided. The differences between visual and linguistic features are discussed. Next, to evaluate the spatial representations learned in the previous task, we introduce a task and a dataset consisting in a set of crowdsourced human ratings of spatial similarity for object pairs. We find that both CNN (convolutional neural network) features and word embeddings predict human judgments of similarity well and that these vectors can be further specialized in spatial knowledge if we update them when training the model that predicts spatial arrangements of objects. Overall, this paper paves the way towards building distributed spatial representations, contributing to the understanding of spatial expressions in language.

Highlights

Representing spatial knowledge is instrumental in any task involving text-to-scene conversion such as robot understanding of natural language commands (Guadarrama et al, 2013; Moratz and Tenbrink, 2006) or a number of robot navigation tasks
To evaluate the quality of the spatial representations learned in the previous task, we introduce a task consisting in a set of 1,016 human ratings of spatial similarity between object pairs
To learn spatial representations we have leveraged the task of predicting the continuous 2D relative spatial arrangement of two objects under a relationship, and a simple embedding-based neural model that learns this task from annotated images

Summary

Introduction

Representing spatial knowledge is instrumental in any task involving text-to-scene conversion such as robot understanding of natural language commands (Guadarrama et al, 2013; Moratz and Tenbrink, 2006) or a number of robot navigation tasks. One may reasonably expect that the more attributes two objects share (e.g., size, functionality, etc.), the more likely they are to exhibit similar spatial arrangements with respect to other objects. Leveraging this intuition, we foresee that visual and linguistic representations can be spatially informative about unseen objects as they encode features/attributes of objects (Collell and Moens, 2016). In this paper we systematically study how informative visual and linguistic features—in the form of convolutional neural network (CNN) features and word embeddings—are about the spatial behavior of objects

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 1, 2018
Citations: 28	License type: cc-by

R Discovery Prime

R Discovery Prime

Learning Representations Specialized in Spatial Knowledge: Leveraging Language and Vision

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Visual tracking with complementary deep feature optimization
Mingquan Shi ... Zhaoming Chen
Journal of Electronic Imaging | VOL. 27
Mingquan Shi, et. al.Mingquan Shi ... Zhaoming Chen
27 Aug 2018
Journal of Electronic Imaging | VOL. 27

Disaster damage detection through synergistic use of deep learning and 3D point cloud features derived from very high resolution oblique aerial images, and multiple-kernel-learning
Anand Vetrivel ... George Vosselman
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 140
Anand Vetrivel, et. al.Anand Vetrivel ... George Vosselman
09 Mar 2017
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 140

Computerized determination scheme for histological classification of masses on breast ultrasonographic images using combination of CNN features and morphologic features
Shinya Kunieda ... Ryohei Nakayama
-
Shinya Kunieda, et. al.Shinya Kunieda ... Ryohei Nakayama
22 May 2020
22 May 2020

Robust Scene Classification with Cross-Level LLC Coding on CNN Features
Zequn Jie ... Shuicheng Yan
-
Zequn Jie, et. al.Zequn Jie ... Shuicheng Yan
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning Representations Specialized in Spatial Knowledge: Leveraging Language and Vision

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics