A Survey on Constraining Policy Updates Using the KL Divergence

Daniel Palenicek

doi:10.1007/978-3-030-41188-6_5

Abstract

Model-free reinforcement learning methods have proven to be successful in learning complex tasks. Optimizing a policy directly based on observations sampled from an environment eliminates the problem of accumulating model errors that model-based methods suffer from. However, model-free methods are less sample efficient compared to their model-based counterparts and may yield unstable policy updates when the step size between successive policy updates is too large. This survey analyzes and compares three state-of-the-art model-free policy search algorithms that address the latter issue of unstable policy updates: namely, relative entropy policy search (REPS), trust region policy optimization (TRPO) and proximal policy optimization (PPO). All three algorithms constrain the policy update using the Kullback-Leibler (KL) divergence. After an introduction to model-free policy search methods, the importance of KL regularization for policy improvement is illustrated. Subsequently, the KL-regularized reinforcement learning problem is introduced and described. REPS, TRPO and PPO are derived from a single set of equations and their differences are detailed. The survey concludes with a discussion of the algorithms’ weaknesses, pointing out directions for future work.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Survey on Constraining Policy Updates Using the KL Divergence

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Authentic Boundary Proximal Policy Optimization.
Yuhu Cheng ... Longyang Huang
IEEE transactions on cybernetics | VOL. 52
Yuhu Cheng, et. al.Yuhu Cheng ... Longyang Huang
11 Mar 2021
IEEE transactions on cybernetics | VOL. 52

Trust region policy optimization via entropy regularization for Kullback–Leibler divergence constraint
Haotian Xu ... Jie Lu
Neurocomputing | VOL. 589
Haotian Xu, et. al.Haotian Xu ... Jie Lu
16 Apr 2024
Neurocomputing | VOL. 589

Empirical Analysis of Automated Stock Trading Using Deep Reinforcement Learning
Minseok Kong ... Jungmin So
Applied Sciences | VOL. 13
Minseok Kong, et. al.Minseok Kong ... Jungmin So
03 Jan 2023
Applied Sciences | VOL. 13

What is the value of the cross-sectional approach to deep reinforcement learning?
Amine Mohamed Aboussalah ... Chi-Guhn Lee
Quantitative Finance | VOL. ahead-of-print
Amine Mohamed Aboussalah, et. al.Amine Mohamed Aboussalah ... Chi-Guhn Lee
07 Dec 2021
Quantitative Finance | VOL. ahead-of-print

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Survey on Constraining Policy Updates Using the KL Divergence

Abstract

Talk to us

Similar Papers