Abstract

Vision-and-language navigation is a newly emerging research topic developing rapidly in recent years, and it is one of the representative research tasks in the frontier field of vision-language interaction. The goal of this task is to realize autonomous navigation based on visual perception of environment according to language instructions given by human. This paper reviews the recent progress in vision-and-language navigation. Firstly, the research content of this task is introduced, and the three main problems and challenges of cross-modal semantic alignments, semantic understanding and reasoning, and generalization ability enhancement are analyzed. Secondly, commonly-used datasets and evaluation metrics are listed. Thirdly, the research progress of this task is summarized from four aspects of imitation learning, reinforcement learning, self-supervised learning and other methods, and the effects of the typical solutions are carefully compared and analyzed. Fourthly, the current research trends of this task are discussed, which mainly include continuous environment navigation, advanced complex instruction comprehension and common sense reasoning. Finally, the future development directions such as 3D visual-and-language navigation, embodied question answering and interactive question answering are further discussed and prospected.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call