Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Arun Balajee Vasudevan,Luc Van Gool,Dengxin Dai

doi:10.1007/s11263-020-01374-3

Arun Balajee Vasudevan, Luc Van Gool + Show 1 more

Open Access

https://doi.org/10.1007/s11263-020-01374-3

Copy DOI

Journal: International Journal of Computer Vision	Publication Date: Aug 31, 2020
Citations: 96	License type: open-access

Affiliation: ETH Zurich, KU Leuven

Abstract

The role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our first contribution is the creation of a large-scale dataset with verbal navigation instructions. To this end, we have developed an interactive visual navigation environment based on Google Street View; we further design an annotation method to highlight mined anchor landmarks and local directions between them in order to help annotators formulate typical, human references to those. The annotation task was crowdsourced on the AMT platform, to construct a new Talk2Nav dataset with 10, 714 routes. Our second contribution is a new learning method. Inspired by spatial cognition research on the mental conceptualization of navigational instructions, we introduce a soft dual attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the next upcoming visual landmark and the other for matching the local directions to the next landmark. On the similar lines, we also introduce spatial memory scheme to encode the local directional transitions. Our work takes advantage of the advance in two lines of research: mental formalization of verbal navigational instructions and training neural network agents for automatic way finding. Extensive experiments show that our method significantly outperforms previous navigation methods. For demo video, dataset and code, please refer to our project page.

Highlights

Consider that you are traveling as a tourist in a new city and are looking for a destination that you would like to visit
Inspired by the research on mental conceptualization of navigational instructions in spatial cognition (Tversky and Lee 1999; Michon and Denis 2001; Klippel and Winter 2005), we introduce a soft attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the coming visual landmark and the other for matching the spatial transition to the landmark
SPL↑ is used as the metric Boldness for the numbers in the tables signify that the corresponding row/method in the table gives the best performance among all other methods whole navigation instruction into landmark descriptions and local directional instructions, the attention map defined on language segments instead of English words, and the two clearly purposed matching modules make our method suitable for long-range vision-and-language navigation

Summary

Introduction

Consider that you are traveling as a tourist in a new city and are looking for a destination that you would like to visit. There is only one other work by Chen et al (2019) on natural language based outdoor navigation, which proposes an outdoor VLN dataset They have designed a great method for data annotation through gaming—to find a hidden object at the goal position, the method has difficulty to be applied to longer routes We develop an interactive visual navigation environment based on Google Street View, and more importantly design a novel annotation method which highlights selected landmarks and the spatial transitions in between This enhanced annotation method makes it feasible to crowdsource this complicated annotation task. The second challenge lies in training a long-range wayfinding agent This learning task requires accurate visual attention and language attention, accurate self-localization and a good sense of direction towards the goal.

Related Works

Talk2Nav Dataset

Data Collection

Directional Instruction Annotation

Landmark Mining

Annotation and Dataset Statistics

Approach

Route Finding Task

Language

Visual Observation

Spatial Memory

Matching Module

Action Module

Learning

Implementation Details

Experiments

Comparison to Prior Works

Methods

Analysis

Ablation Studies

Qualitative Analysis

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computer Vision

Lead the way for us

Similar Papers

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
...
International Journal of Computer Vision | VOL. 129
, et. al. ...
31 Aug 2020
International Journal of Computer Vision | VOL. 129

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
Raphael Schumann ... Weixi Feng
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Raphael Schumann, et. al.Raphael Schumann ... Weixi Feng
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

The effects of spatial representation on memory for verbal navigation instructions
Immanuel Barshi ... Alice F Healy
Memory & Cognition | VOL. 39
Immanuel Barshi, et. al.Immanuel Barshi ... Alice F Healy
06 Nov 2010
Memory & Cognition | VOL. 39

Work-in-Progress–—Improve Spatial Learning by Chunking Navigation Instructions in Mixed Reality
Bing Liu ... Zhicheng Zhan
-
Bing Liu, et. al.Bing Liu ... Zhicheng Zhan
17 May 2021
17 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computer Vision