Abstract
Model-free reinforcement learning methods have proven to be successful in learning complex tasks. Optimizing a policy directly based on observations sampled from an environment eliminates the problem of accumulating model errors that model-based methods suffer from. However, model-free methods are less sample efficient compared to their model-based counterparts and may yield unstable policy updates when the step size between successive policy updates is too large. This survey analyzes and compares three state-of-the-art model-free policy search algorithms that address the latter issue of unstable policy updates: namely, relative entropy policy search (REPS), trust region policy optimization (TRPO) and proximal policy optimization (PPO). All three algorithms constrain the policy update using the Kullback-Leibler (KL) divergence. After an introduction to model-free policy search methods, the importance of KL regularization for policy improvement is illustrated. Subsequently, the KL-regularized reinforcement learning problem is introduced and described. REPS, TRPO and PPO are derived from a single set of equations and their differences are detailed. The survey concludes with a discussion of the algorithms’ weaknesses, pointing out directions for future work.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.