Abstract

Visually impaired individuals face great challenges with independently navigating dynamic environments because of their inability to fully comprehend the environment and actions of surrounding people. Conventional navigation approaches like Simultaneous Localization And Mapping (SLAM) rely on complete scanned maps to navigate static, fixed environments. With Vision Language Navigation (VLN), agents can understand semantic information to expand navigation to similar environments. However, both cannot accurately navigate dynamic environments containing human actions. To address this challenge, we propose a novel cross-modal transformer-based Action-Aware VLN system (AAVLN). Our AAVLN Agent Algorithm is trained using Reinforcement Learning in our Environment Simulator. AAVLNs novel cross-modal transformer structure allows the Agent Algorithm to understand natural language instructions and semantic information for navigating dynamic environments and recognizing human actions. For training, we use Reinforcement Learning in our action-based environment simulator. We created it by combining an existing simulator with our novel 3D human action generator. Our experimental results demonstrate the effectiveness of our approach, outperforming current methods on various metrics across challenging benchmarks. Our ablation studies also highlight that we increase dynamic navigation accuracy with our Vision Transformer based human action recognition module and cross-modal encoding. We are currently constructing 3D models of real-world environments, including hospitals and schools, for further training AAVLN. Our project will be combined with Chat-GPT to improve natural language interactions. AAVLN will have numerous applications in robotics, AR, and other computer vision fields.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call