Vision-and-language navigation based on history-aware cross-modal feature fusion in indoor environment

Shuhuan Wen,Simeng Gong,Ziyuan Zhang,F Richard Yu,Zhiwen Wang

doi:10.1016/j.knosys.2024.112610

Abstract

Vision-and-language navigation (VLN) is a challenging task that requires an agent to navigate an indoor environment using natural language instructions. Traditional VLN employs cross-modal feature fusion, where visual and textual information are combined to guide the agent’s navigation. However, incomplete use of perceptual information, scarcity of domain-specific training data, and diverse image and language inputs result in suboptimal performance. Herein, we propose a cross-modal feature fusion VLN history-aware information, that leverages an agent’s past experiences to make more informed navigation decisions. The regretful model and self-monitoring models are added, and the advantage actor critic(A2C) reinforcement learning algorithm is employed to improve the navigation success rate, reduce action redundancy, and shorten navigation paths. Subsequently, a data augmentation method based on speaker data is introduced to improve the model generalizability. We evaluate the proposed algorithm on the room-to-room (R2R) and room-for-room (R4R) benchmarks, and the experimental results demonstrate that, by comparison, the proposed algorithm outperforms state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Vision-and-language navigation based on history-aware cross-modal feature fusion in indoor environment

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Similar Papers

Airbert: In-domain Pretraining for Vision-and-Language Navigation
Pierre-Louis Guhur ... Ivan Laptev
-
Pierre-Louis Guhur, et. al.Pierre-Louis Guhur ... Ivan Laptev
01 Oct 2021
01 Oct 2021

Vision and Language Navigation using Multi-head Attention Mechanism
Sai Mao ... Junmin Wu
-
Sai Mao, et. al.Sai Mao ... Junmin Wu
01 Dec 2020
01 Dec 2020

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
Shizhe Chen ... Cordelia Schmid
-
Shizhe Chen, et. al.Shizhe Chen ... Cordelia Schmid
01 Jan 2021
01 Jan 2021

WebVLN: Vision-and-Language Navigation on Websites
Qi Chen ... Gengze Zhou
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Qi Chen, et. al.Qi Chen ... Gengze Zhou
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Vision-and-language navigation based on history-aware cross-modal feature fusion in indoor environment

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems