Abstract

Vision-and-language navigation (VLN) is a challenging task that requires an agent to navigate an indoor environment using natural language instructions. Traditional VLN employs cross-modal feature fusion, where visual and textual information are combined to guide the agent’s navigation. However, incomplete use of perceptual information, scarcity of domain-specific training data, and diverse image and language inputs result in suboptimal performance. Herein, we propose a cross-modal feature fusion VLN history-aware information, that leverages an agent’s past experiences to make more informed navigation decisions. The regretful model and self-monitoring models are added, and the advantage actor critic(A2C) reinforcement learning algorithm is employed to improve the navigation success rate, reduce action redundancy, and shorten navigation paths. Subsequently, a data augmentation method based on speaker data is introduced to improve the model generalizability. We evaluate the proposed algorithm on the room-to-room (R2R) and room-for-room (R4R) benchmarks, and the experimental results demonstrate that, by comparison, the proposed algorithm outperforms state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.