View-Action Representation Learning for Active First-Person Vision

Changjae Oh,Andrea Cavallaro

doi:10.1109/tcsvt.2020.2987562

Changjae Oh, Andrea Cavallaro

Open Access

https://doi.org/10.1109/tcsvt.2020.2987562

Copy DOI

Abstract

In visual navigation, a moving agent equipped with a camera is traditionally controlled by an input action and the estimation of the features from a sensory state (i.e. the camera view) is treated as a pre-processing step to perform high-level vision tasks. In this paper, we present a representation learning approach that, instead, considers both state and action as inputs. We condition the encoded feature from the state transition network on the action that changes the view of the camera, thus describing the scene more effectively. Specifically, we introduce an action representation module that generates decoded higher dimensional representations from an input action to increase the representational power. We then fuse the output from the action representation module with the intermediate response of the state transition network that predicts the future state. To enhance the discrimination capability among predictions from different input actions, we further introduce triplet ranking loss and $N$ -tuplet loss functions, which in turn can be integrated with the regression loss. We demonstrate the proposed representation learning approach in reinforcement and imitation learning-based mapless navigation tasks, where the camera agent learns to navigate only through the view of the camera and the performed action, without external information.

Highlights

V ISUAL navigation generates through specific actions new input data by changing or selecting views in order to perform, for example, object detection [1], visual categorisation [2], [3], or image enhancement [4]–[6]
We proposed a view-action representation learning method that expands the dimensions of one-hot codes of input actions and fuses them with a state-transition network
In the context of mapless visual navigation, we presented two loss functions, triplet ranking loss and N-tuplet loss, each of which can be combined with the regression loss for effective representation learning

Summary

INTRODUCTION

V ISUAL navigation generates through specific actions new input data by changing or selecting views in order to perform, for example, object detection [1], visual categorisation [2], [3], or image enhancement [4]–[6]. Deep neural networks can be used to learn to navigate using various strategies, such as feed-forward model [15], reinforcement learning (RL) [7], [16]–[20], or imitation learning (IL) [21], [22] In this case, the control heavily depends on the representations of the input data: the camera agent is trained to perform actions to reach, from the current camera view (state), the final goal state. While learning methods to control the camera agent has been thoroughly investigated [24], there is limited understanding on how to design an efficient neural network-based architecture for the representation from state-action pairs. This paper substantially extends our previous work [20] with (i) joint regression and N-tuplet loss functions that generalise the joint regression and triplet ranking loss functions; (ii) variants of fusion approaches that combine the action representation module and the state-transition network; and (iii) qualitative and quantitative comparisons under RL and IL-based mapless navigation tasks

Visual Navigation With and Without External Information

Representation Learning for Camera Control

Problem Description

Forward Model

Fusion

Inverse Model and Policy Network

REINFORCEMENT AND IMITATION LEARNING

Reinforcement Learning for Mapless Navigation

Imitation Learning for Mapless Navigation

VALIDATION

Reinforcement Learning

Imitation Learning

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Circuits and Systems for Video Technology	Publication Date: Apr 17, 2020
Citations: 48	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

View-Action Representation Learning for Active First-Person Vision

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Similar Papers

Dual-LED Imaging System for Secure and Robust Fingerprint Detection
Ichiro Fujieda ... Etsuji Matsuyama
Journal of Robotics and Mechatronics | VOL. 18
Ichiro Fujieda, et. al.Ichiro Fujieda ... Etsuji Matsuyama
20 Dec 2006
Journal of Robotics and Mechatronics | VOL. 18

SIGCHI Outstanding Dissertation Award
Anna Maria Feit
-
Anna Maria FeitAnna Maria Feit
02 May 2019
02 May 2019

Infer the missing facts of D3FEND using knowledge graph representation learning
Anish Khobragade ... Vinod Pachghare
International Journal of Web Information Systems | VOL. 19
Anish Khobragade, et. al.Anish Khobragade ... Vinod Pachghare
16 Aug 2023
International Journal of Web Information Systems | VOL. 19

Optimization of Analog-Discrete Conversions of the Input Action in the Subsystem of the Receiving Channel of the Information-Measuring System
A A Burmaka ... T N Govorukhina
Proceedings of the Southwest State University | VOL. 23
A A Burmaka, et. al.A A Burmaka ... T N Govorukhina
23 Feb 2020
Proceedings of the Southwest State University | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

View-Action Representation Learning for Active First-Person Vision

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology