Visual robot navigation adaptive to directional cues using vision-language models and conditional diffusion models

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

This paper presents a method for visual robot navigation guided by diverse directional cues, such as human pointing gestures and arrow signages. Existing visual navigation policies often focus solely on scene traversability or specific cue types, limiting their generality. To address this, we propose a unified guided visual navigation framework that interprets various directional cues using large Vision-Language Models (VLMs) and adapts its behavior via conditional diffusion models. The VLM recognizes the semantic meaning of directional cues and provides textual direction labels that condition a probabilistic navigation policy. Through experiments with real robots in office environments, the proposed system demonstrates the ability to follow both gesture and arrow cues adaptively within a single model. While some limitations remain in real-time cue recognition and collision-free operation, the results highlight the potential of combining VLMs and diffusion models for socially adaptive visual navigation.

Similar Papers
  • Conference Article
  • Cite Count Icon 6
  • 10.1109/acait56212.2022.10137974
Visual Navigation of Mobile Robots in Complex Environments Based on Distributed Deep Reinforcement Learning
  • Dec 9, 2022
  • Yi Zhang + 5 more

The increasingly popular method of deep reinforce- ment learning can not only help mobile robots output accurate actions in complex environments but can also search for collisionfree paths. In this paper, a robot visual navigation model in complex environments based on distributed deep reinforcement learning is proposed. According to the characteristics of different regions in the complex environment, the environment is divided into several regions, and we proposed method can realize visual navigation in large scene complex environments. In these regions, we combine long-short term memory (LSTM) and proximal policy optimization (PPO) algorithms as a local visual navigation model and design a new reward function that trains the target through factors such as the action of mobile robots, the distance between robots and the target, and the running time of robots. We create respective experience pool independently through model training. The model of robot visual navigation via distributed deep reinforcement learning uses the RGB-D image obtained from the first perspective of mobile robots and the polar coordinates of the target in mobile robots coordinate system as input, and the continuous motion of mobile robots as output to realize the task of end-to-end visual navigation without maps. Our model can complete accurately robot visual navigation in large complex scenes without maps and human intervention. In our experiments, we verify our proposed model by performing the promising navigation tasks in virtual environments.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icpr48806.2021.9413026
On Embodied Visual Navigation in Real Environments Through Habitat
  • Jan 10, 2021
  • Marco Rosano + 3 more

Visual navigation models based on deep learning can learn effective policies when trained on large amounts of visual observations through reinforcement learning. Unfortunately, collecting the required experience in the real world requires the deployment of a robotic platform, which is expensive and time-consuming. To deal with this limitation, several simulation platforms have been proposed in order to train visual navigation policies on virtual environments efficiently. Despite the advantages they offer, simulators present a limited realism in terms of appearance and physical dynamics, leading to navigation policies that do not generalize in the real world. In this paper, we propose a tool based on the Habitat simulator which exploits real world images of the environment, together with sensor and actuator noise models, to produce more realistic navigation episodes. We perform a range of experiments to assess the ability of such policies to generalize using virtual and real-world images, as well as observations transformed with unsupervised domain adaptation approaches. We also assess the impact of sensor and actuation noise on the navigation performance and investigate whether it allows to learn more robust navigation policies. We show that our tool can effectively help to train and evaluate navigation policies on real-world observations without running navigation episodes in the real world.

  • Research Article
  • Cite Count Icon 167
  • 10.1016/j.neuroimage.2006.06.048
Automatic attention orienting by social and symbolic cues activates different neural networks: An fMRI study
  • Sep 1, 2006
  • NeuroImage
  • Jari K Hietanen + 4 more

Automatic attention orienting by social and symbolic cues activates different neural networks: An fMRI study

  • Conference Article
  • 10.1109/iecon.2018.8592878
Robotic Obstacle Avoidance for Visual Navigation Based on Local Descriptors and Feasible Path
  • Oct 1, 2018
  • Dong Liu + 4 more

Aiming at the obstacles in robot visual navigation path, a robot obstacle avoidance method based on local descriptors and feasible path is proposed to achieve local path planning. The local grid of the robot is divided and the obstacle model is established. The risk function and feasible path of the robot is defined. For visual navigation, the navigation direction angle is obtained based on the horizontal coordinate of local descriptors matching pair. While for obstacles in robot navigation path, the local obstacle avoidance control strategy is presented to find a feasible path around the obstacles. The method can choose an optimal path for obstacle avoidance and then return to the planned path through environmental interaction during visual navigation. The experimental result shows that this method can effectively achieve obstacle avoidance task during robot visual navigation.

  • Research Article
  • Cite Count Icon 1
  • 10.1504/ijvas.2006.012209
Distributed visual navigation based on neural Q-learning for a mobile robot
  • Jan 1, 2006
  • International Journal of Vehicle Autonomous Systems
  • Guosheng Yang + 2 more

Distributed visual navigation based on neural Q-learning for a mobile robot is studied in this paper. First, a general distributed structure based on the multiple processors for visual navigation is established according to the decomposition of the mobile robot visual navigation task. Second, in terms of the general distributed structure, the local environment description method based on the Peer Group Filtering (PGF) and fuzzy technology is put forward. Third, in each local environment description, a controller based on neural Q-learning is designed to guide the mobile robot navigation. In the last part of this paper, experimental simulations are done to test the effectiveness of the presented distributed algorithm, including the image segmentation, environment description and navigation policy.

  • Video Transcripts
  • 10.48448/59kf-w724
On Embodied Visual Navigation in Real Environment Through Habitat
  • Dec 29, 2020
  • Marco Rosano

Visual navigation models based on deep learning can learn effective policies when trained on large amounts of visual observations through reinforcement learning. Unfortunately, collecting the required experience in the real world requires the deployment of a robotic platform, which is expensive and time-consuming. To deal with this limitation, several simulation platforms have been proposed in order to train visual navigation policies on virtual environments efficiently. Despite the advantages they offer, simulators present a limited realism in terms of appearance and physical dynamics, leading to navigation policies that do not generalize in the real world. In this paper, we propose a tool based on the Habitat simulator which exploits real world images of the environment, together with sensor and actuator noise models, to produce more realistic navigation episodes. We perform a range of experiments to assess the ability of such policies to generalize using virtual and real-world images, as well as observations transformed with unsupervised domain adaptation approaches. We also assess the impact of sensor and actuation noise on the navigation performance and investigate whether it allows to learn more robust navigation policies. We show that our tool can effectively help to train and evaluate navigation policies on real-world observations without running navigation episodes in the real world.

  • Conference Article
  • 10.1109/robio49542.2019.8961726
Visual Topological Mapping and Navigation for Mobile Robot in Large-Scale Environment
  • Dec 1, 2019
  • Song Xu + 2 more

Autonomous navigation is a basic prerequisite for mobile robot to realize environmental exploration. Current navigation methods are mainly based on metric maps, which require precise geometric coordinates and lack the capability to efficiently store semantic information of the environment. In this paper, we present a visual topological mapping and navigation method for mobile robot in large-scale environment, which is similar to the human navigation system. Topological map represents the environment as a topology diagram with nodes and edges in which the topological nodes record local semantic information of the environment, such as visual features, robot pose and scene properties. In the topological navigation stage, an image-based Monte Carlo localization is proposed to estimate the semantic pose of robot which can help robot judge whether it has reached the target location more flexibility. Experiments are conducted in real world environments and results indicate that the proposed system exhibits great performance in robustness of navigation.

  • Conference Article
  • Cite Count Icon 6
  • 10.2312/egve/egve04/053-060
Effect of visual cues on human performance in navigating through a virtual maze
  • Jun 8, 2004
  • Deepak Vembar + 5 more

Navigation in Virtual Environments (VEs) requires users to develop spatial knowledge of the environment primarily through visual cues provided to the user. Thus, the design and display of visual navigation cues is important for efficient navigation in a VE. In this paper, we report the results of an experiment in which three different visual cues were tested for their benefit toward users' navigation in a 3D virtual maze. The experiment varied the form of visual cue: a 2D map, a 2D map with a directionally ambiguous cue, and a 2D map with a directional cue. Eye tracking data was collected and analyzed to examine the correlation between the type of visual cue presented and the navigational efficiency of the user through the virtual maze. It was observed that the cue type affected performance of the participant in the 3D maze. The directional cue was most effective in the time taken by users to reach the center of the maze. Results of this study have implications for VE design as well as for game development.

  • Research Article
  • Cite Count Icon 53
  • 10.1007/s10846-020-01262-5
Incremental Learning for Autonomous Navigation of Mobile Robots based on Deep Reinforcement Learning
  • Dec 7, 2020
  • Journal of Intelligent & Robotic Systems
  • Manh Luong + 1 more

This paper presents an incremental learning method and system for autonomous robot navigation. The range finder laser sensor and online deep reinforcement learning are utilized for generating the navigation policy, which is effective for avoiding obstacles along the robot’s trajectories as well as for robot’s reaching the destination. An empirical experiment is conducted under simulation and real-world settings. Under the simulation environment, the results show that the proposed method can generate a highly effective navigation policy (more than 90% accuracy) after only 150k training iterations. Moreover, our system has slightly outperformed deep-Q, while having considerably surpassed Proximal Policy Optimization, two recent state-of-the art robot navigation systems. Finally, two experiments are performed to demonstrate the feasibility and effectiveness of our robot’s proposed navigation system in real-time under real-world settings.

  • Conference Article
  • Cite Count Icon 26
  • 10.1109/robot.2005.1570844
Robot Visual Navigation in Semi-structured Outdoor Environments
  • Jan 21, 2018
  • D Mateus + 2 more

This work describes a navigation framework for robots in semi-structured outdoor environments which enables planning of semantic tasks by chaining of elementary visual-based movement primitives. Navigation is achieved by understanding the underlying world behind the image and using these results as a guideline to control the robot. As retrieving semantic information from vision is computationally demanding, short-term tasks are planned and executed while new vision information is processed. Thanks to learning techniques, the methods are adapted to different environment conditions. Fusion and filtering techniques provide reliability and stability to the system. The procedures have been fully integrated and tested with a real robot in an experimental environment. Results are discussed.

  • Conference Article
  • Cite Count Icon 53
  • 10.1109/iros.2013.6696802
Teaching mobile robots to cooperatively navigate in populated environments
  • Nov 1, 2013
  • Markus Kuderer + 2 more

Mobile service robots are envisioned to operate in environments that are populated by humans and therefore ought to navigate in a socially compliant way. Since the desired behavior of the robots highly depends on the application, we need flexible means for teaching a robot a certain navigation policy. We present an approach that allows a mobile robot to learn how to navigate in the presence of humans while it is being teleoperated in its designated environment. Our method applies feature-based maximum entropy learning to derive a navigation policy from the interactions with the humans. The resulting policy maintains a probability distribution over the trajectories of all the agents that allows the robot to cooperatively avoid collisions with humans. In particular, our method reasons about multiple homotopy classes of the agents' trajectories, i. e., on which sides the agents pass each other. We implemented our approach on a real mobile robot and demonstrate that it is able to successfully navigate in an office environment in the presence of humans relying only on on-board sensors.

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/icra40945.2020.9196602
Zero-shot Imitation Learning from Demonstrations for Legged Robot Visual Navigation
  • May 1, 2020
  • Xinlei Pan + 5 more

Imitation learning is a popular approach for training effective visual navigation policies. However, collecting expert demonstrations for legged robots is challenging as these robots can be hard to control, move slowly, and cannot operate continuously for long periods of time. In this work, we propose a zero-shot imitation learning framework for training a goal-driven visual navigation policy on a legged robot from human demonstrations (third-person perspective), allowing for high-quality navigation and cost-effective data collection. However, imitation learning from third-person demonstrations raises unique challenges. First, these demonstrations are captured from different camera perspectives, which we address via a feature disentanglement network (FDN) that extracts perspective-invariant state features. Second, as transition dynamics vary between systems, we reconstruct missing action labels by either building an inverse model of the robot’s dynamics in the feature space and applying it to the human demonstrations or developing a Graphic User Interface (GUI) to label human demonstrations. To train a navigation policy we use a model-based imitation learning approach with FDN and action-labeled human demonstrations. We show that our framework can learn an effective policy for a legged robot, Laikago, from human demonstrations in both simulated and real-world environments. Our approach is zero-shot as the robot never navigates the same paths during training as those at testing time. We justify our framework by performing a comparative study.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/oceans.1984.1152232
Experience with Visual Robot Navigation
  • Jan 1, 1984
  • L Matthies + 1 more

The CMU Mobile Robot Lab is studying issues in the development of autonomous vehicles, including path planning, motion determination, and obstacle detection from video and sonar data. We have built a simple testbed vehicle and a visual navigation system designed to maneuver to a pre-defined location in a static environment. The visual system is based on algorithms developed by Moravec for the Stanford Cart [10]. At each Cart position, these algorithms used stereo correspondence in nine camera images to triangulate the distance to potential obstacles. Motion of the vehicle was determined by tracking these obstacles over time. This paper discusses several issues in the on-going evolution from the Cart to our present system. These issues have led to the use of fewer images per step, to the use of more constraint in the correspondence process, and toward the use of a different motion solving algorithm that better embodies the rigidity property inherent in the problem.

  • Research Article
  • Cite Count Icon 35
  • 10.1109/lra.2021.3068106
Visual Navigation in Real-World Indoor Environments Using End-to-End Deep Reinforcement Learning
  • Mar 2, 2021
  • IEEE Robotics and Automation Letters
  • Jonas Kulhanek + 2 more

Visual navigation is essential for many applications in robotics, from manipulation, through mobile robotics to automated driving. Deep reinforcement learning (DRL) provides an elegant map-free approach integrating image processing, localization, and planning in one module, which can be trained and therefore optimized for a given environment. However, to date, DRL-based visual navigation was validated exclusively in simulation, where the simulator provides information that is not available in the real world, e.g., the robot's position or image segmentation masks. This precludes the use of the learned policy on a real robot. Therefore, we propose a novel approach that enables a direct deployment of the trained policy on real robots. We have designed visual auxiliary tasks, a tailored reward scheme, and a new powerful simulator to facilitate domain randomization. The policy is fine-tuned on images collected from real-world environments. We have evaluated the method on a mobile robot in a real office environment. The training took ~30 hours on a single GPU. In 30 navigation experiments, the robot reached a 0.3-meter neighborhood of the goal in more than 86.7% of cases. This result makes the proposed method directly applicable to tasks like mobile manipulation.

  • Research Article
  • Cite Count Icon 7
  • 10.1109/tnnls.2023.3300888
Learning Heterogeneous Relation Graph and Value Regularization Policy for Visual Navigation.
  • Nov 1, 2024
  • IEEE Transactions on Neural Networks and Learning Systems
  • Kang Zhou + 3 more

The goal of visual navigation is steering an agent to find a given target object with current observation. It is crucial to learn an informative visual representation and robust navigation policy in this task. Aiming to promote these two parts, we propose three complementary techniques, heterogeneous relation graph (HRG), a value regularized navigation policy (VRP), and gradient-based meta learning (ML). HRG integrates object relationships, including object semantic closeness and spatial directions, e.g., a knife is usually co-occurrence with bowl semantically or located at the left of the fork spatially. It improves visual representation learning. Both VRP and gradient-based ML improve robust navigation policy, regulating this process of the agent to escape from the deadlock states such as being stuck or looping. Specifically, gradient-based ML is a type of supervision method used in policy network training, which eliminates the gap between the seen and unseen environment distributions. In this process, VRP maximizes the transformation of the mutual information between visual observation and navigation policy, thus improving more informed navigation decisions. Our framework shows superior performance over the current state-of-the-art (SOTA) in terms of success rate and success weighted by length (SPL). Our HRG outperforms the Visual Genome knowledge graph on cross-scene generalization with ≈ 56% and ≈ 39% improvement on Hits@ 5* (proportion of correct entities ranked in top 5) and MRR * (mean reciprocal rank), respectively. Our code and HRG datasets will be made publicly available in the scientific community.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.