Abstract

One challenge in applying reinforcement learning with nonlinear function approximator to high- dimensional continuous control problems is that the update policy produced by the many existed algorithms may fail to improve policy performance or even causes a serious degradation of the policy performance. To address this challenge, this paper proposes a new lower bound on the policy improvement where an average policy divergence on state space is penalized. To the best of our knowledge, this is currently the best result about the lower bound on the policy improvement. Optimizing directly the lower bound on the policy improvement is very difficult, because it demands for high computational overhead. According to the ideal of the trust region policy optimization (TRPO), this paper also presents a monotonic policy optimization algorithm, which is based on the new lower bound on the policy improvement introduced in this paper, it can generate a sequence of monotonically improving policies, and it is suitable for the large-scale continuous control problems. This paper also evaluates and compares the proposed algorithms with some of the existed algorithms on highly challenging robot locomotion tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call