Abstract

Abstract In this paper, we present a new intrinsically motivated actor-critic algorithm for learning continuous motor skills directly from raw visual input. Our neural architecture is composed of a critic and an actor network. Both networks receive the hidden representation of a deep convolutional autoencoder which is trained to reconstruct the visual input, while the centre-most hidden representation is also optimized to estimate the state value. Separately, an ensemble of predictive world models generates, based on its learning progress, an intrinsic reward signal which is combined with the extrinsic reward to guide the exploration of the actor-critic learner. Our approach is more data-efficient and inherently more stable than the existing actor-critic methods for continuous control from pixel data. We evaluate our algorithm for the task of learning robotic reaching and grasping skills on a realistic physics simulator and on a humanoid robot. The results show that the control policies learned with our approach can achieve better performance than the compared state-of-the-art and baseline algorithms in both dense-reward and challenging sparse-reward settings.

Highlights

  • IntroductionTo improve sample efficiency in deep Reinforcement Learning (RL), different approaches have recently been proposed

  • An autonomous agent learning control skills from trial and error in an unknown environment with zero prior knowledge is faced with the challenging task of correctlyTo improve sample efficiency in deep Reinforcement Learning (RL), different approaches have recently been proposed

  • In this paper, we present a new intrinsically motivated actor-critic algorithm for learning continuous motor skills directly from raw visual input

Read more

Summary

Introduction

To improve sample efficiency in deep RL, different approaches have recently been proposed. Schaul et al pointed out that for most deep RL methods, transitions are randomly drawn from a replay buffer of recent transitions whenever a learning update for the network weights is performed. Instead of this inefficient sampling, they proposed a Prioritized Experience Replay, where each transition in the buffer is assigned a sampling probability proportional to its temporal-difference error [2]. An agent learns an estimate of the expectation over the future state representations from a given state and action, called Open Access. This allows for replacing the state-action value function, which estimates the expected future reward, with a function estimating only the immediate reward using the SR, and thereby eliminating the need for the slow propagation of state-action values among visited states

Methods
Findings
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.