Abstract
The role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our first contribution is the creation of a large-scale dataset with verbal navigation instructions. To this end, we have developed an interactive visual navigation environment based on Google Street View; we further design an annotation method to highlight mined anchor landmarks and local directions between them in order to help annotators formulate typical, human references to those. The annotation task was crowdsourced on the AMT platform, to construct a new Talk2Nav dataset with 10, 714 routes. Our second contribution is a new learning method. Inspired by spatial cognition research on the mental conceptualization of navigational instructions, we introduce a soft dual attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the next upcoming visual landmark and the other for matching the local directions to the next landmark. On the similar lines, we also introduce spatial memory scheme to encode the local directional transitions. Our work takes advantage of the advance in two lines of research: mental formalization of verbal navigational instructions and training neural network agents for automatic way finding. Extensive experiments show that our method significantly outperforms previous navigation methods. For demo video, dataset and code, please refer to our project page.
Highlights
Consider that you are traveling as a tourist in a new city and are looking for a destination that you would like to visit
Inspired by the research on mental conceptualization of navigational instructions in spatial cognition (Tversky and Lee 1999; Michon and Denis 2001; Klippel and Winter 2005), we introduce a soft attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the coming visual landmark and the other for matching the spatial transition to the landmark
SPL↑ is used as the metric Boldness for the numbers in the tables signify that the corresponding row/method in the table gives the best performance among all other methods whole navigation instruction into landmark descriptions and local directional instructions, the attention map defined on language segments instead of English words, and the two clearly purposed matching modules make our method suitable for long-range vision-and-language navigation
Summary
Consider that you are traveling as a tourist in a new city and are looking for a destination that you would like to visit. There is only one other work by Chen et al (2019) on natural language based outdoor navigation, which proposes an outdoor VLN dataset They have designed a great method for data annotation through gaming—to find a hidden object at the goal position, the method has difficulty to be applied to longer routes We develop an interactive visual navigation environment based on Google Street View, and more importantly design a novel annotation method which highlights selected landmarks and the spatial transitions in between This enhanced annotation method makes it feasible to crowdsource this complicated annotation task. The second challenge lies in training a long-range wayfinding agent This learning task requires accurate visual attention and language attention, accurate self-localization and a good sense of direction towards the goal.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have