Points, Paths, and Playscapes: Large-scale Spatial Language Understanding Tasks Set in the Real World

Jason Baldridge,Bo Pang,Tania Bedrax-Weiss,Daphne Luong,Srini Narayanan,Radu Soricut,Fernando Pereira,Michael Tseng,Yuan Zhang

doi:10.18653/v1/w18-1406

Abstract

Spatial language understanding is important for practical applications and as a building block for better abstract language understanding. Much progress has been made through work on understanding spatial relations and values in images and texts as well as on giving and following navigation instructions in restricted domains. We argue that the next big advances in spatial language understanding can be best supported by creating large-scale datasets that focus on points and paths based in the real world, and then extending these to create online, persistent playscapes that mix human and bot players, where the bot players must learn, evolve, and survive according to their depth of understanding of scenes, navigation, and interactions.

Highlights

Language is not sealed in a textual medium disconnected from the world
Mental simulation involving motor and perceptual content likely plays a crucial role in sentence comprehension (Bergen et al, 2010)
We argue that the big advances in spatial language understanding can be best enabled by first creating large-scale datasets that require spatial understanding of real world points and paths, and building on these to create persistent, online playscapes that enable both automated agents and people to interact in virtual and augmented reality environments

Summary

Introduction

Language is not sealed in a textual medium disconnected from the world. People use language to talk about people, places and things that exist both in time and space. Geospatial mapping applications (such as Google Maps) provide algorithmic, route-based instruction at a global scale They rely on explicitly named roads, paths, and addresses, and they assume a large database as a model of the world, which includes mappings between names and geo locations. Such systems give instructions but cannot interpret them, much less interact with a human user. Scene understanding—in both images and texts—is needed at both ends of this scale and in between We expect that such a project provides challenges of high complexity, while linking in to rich, already-available resources that connect both text and images to each other and to key metadata, including coordinates in both space and time

Data and annotation

Task considerations

Conclusion