Abstract

Model-free reinforcement learning methods have proven to be successful in learning complex tasks. Optimizing a policy directly based on observations sampled from an environment eliminates the problem of accumulating model errors that model-based methods suffer from. However, model-free methods are less sample efficient compared to their model-based counterparts and may yield unstable policy updates when the step size between successive policy updates is too large. This survey analyzes and compares three state-of-the-art model-free policy search algorithms that address the latter issue of unstable policy updates: namely, relative entropy policy search (REPS), trust region policy optimization (TRPO) and proximal policy optimization (PPO). All three algorithms constrain the policy update using the Kullback-Leibler (KL) divergence. After an introduction to model-free policy search methods, the importance of KL regularization for policy improvement is illustrated. Subsequently, the KL-regularized reinforcement learning problem is introduced and described. REPS, TRPO and PPO are derived from a single set of equations and their differences are detailed. The survey concludes with a discussion of the algorithms’ weaknesses, pointing out directions for future work.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call