Abstract

Vision-and-Language Navigation (VLN) is a task designed to enable embodied agents carry out natural language instructions in realistic environments. Most VLN tasks, however, are guided by an elaborate set of instructions that is depicted step-by-step. This approach deviates from real-world problems in which humans only describe the object and its surroundings and allow the robot to ask for help when required. Vision-based Navigation with Language-based Assistance (VNLA) is a recently proposed task that requires an agent to navigate and find a target object according to a high-level language instruction. Due to the lack of step-by-step navigation guidance, the key to VNLA is to conduct goal-oriented exploration. In this paper, we design an Attention-based Knowledge-enabled Cross-modality Reasoning with Assistant’s Help (AKCR-AH) model to address the unique challenges of this task. AKCR-AH learns a generalized navigation strategy from three new perspectives: (1) external commonsense knowledge is incorporated into visual relational reasoning, so as to take proper action at each viewpoint by learning the internal–external correlations among object- and room-entities; (2) a simulated human assistant is introduced in the environment, who provides direct intervention assistance when required; (3) a memory-based Transformer architecture is adopted as the policy framework to make full use of the history clues stored in memory tokens for exploration. Extensive experiments demonstrate the effectiveness of our method compared with other baselines.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.