Abstract

In visual navigation, a moving agent equipped with a camera is traditionally controlled by an input action and the estimation of the features from a sensory state (i.e. the camera view) is treated as a pre-processing step to perform high-level vision tasks. In this paper, we present a representation learning approach that, instead, considers both state and action as inputs. We condition the encoded feature from the state transition network on the action that changes the view of the camera, thus describing the scene more effectively. Specifically, we introduce an action representation module that generates decoded higher dimensional representations from an input action to increase the representational power. We then fuse the output from the action representation module with the intermediate response of the state transition network that predicts the future state. To enhance the discrimination capability among predictions from different input actions, we further introduce triplet ranking loss and $N$ -tuplet loss functions, which in turn can be integrated with the regression loss. We demonstrate the proposed representation learning approach in reinforcement and imitation learning-based mapless navigation tasks, where the camera agent learns to navigate only through the view of the camera and the performed action, without external information.

Highlights

  • V ISUAL navigation generates through specific actions new input data by changing or selecting views in order to perform, for example, object detection [1], visual categorisation [2], [3], or image enhancement [4]–[6]

  • We proposed a view-action representation learning method that expands the dimensions of one-hot codes of input actions and fuses them with a state-transition network

  • In the context of mapless visual navigation, we presented two loss functions, triplet ranking loss and N-tuplet loss, each of which can be combined with the regression loss for effective representation learning

Read more

Summary

INTRODUCTION

V ISUAL navigation generates through specific actions new input data by changing or selecting views in order to perform, for example, object detection [1], visual categorisation [2], [3], or image enhancement [4]–[6]. Deep neural networks can be used to learn to navigate using various strategies, such as feed-forward model [15], reinforcement learning (RL) [7], [16]–[20], or imitation learning (IL) [21], [22] In this case, the control heavily depends on the representations of the input data: the camera agent is trained to perform actions to reach, from the current camera view (state), the final goal state. While learning methods to control the camera agent has been thoroughly investigated [24], there is limited understanding on how to design an efficient neural network-based architecture for the representation from state-action pairs. This paper substantially extends our previous work [20] with (i) joint regression and N-tuplet loss functions that generalise the joint regression and triplet ranking loss functions; (ii) variants of fusion approaches that combine the action representation module and the state-transition network; and (iii) qualitative and quantitative comparisons under RL and IL-based mapless navigation tasks

Visual Navigation With and Without External Information
Representation Learning for Camera Control
Problem Description
Forward Model
Fusion
Inverse Model and Policy Network
REINFORCEMENT AND IMITATION LEARNING
Reinforcement Learning for Mapless Navigation
Imitation Learning for Mapless Navigation
VALIDATION
Reinforcement Learning
Imitation Learning
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.