Abstract

Policy gradient algorithms for reinforcement learning (RL) have successfully tackled a broad range of high-dimensional continuous RL problems, including many challenging robotic control problems. These algorithms can be largely divided into two categories, i.e., on-policy algorithms and off-policy algorithms. Off-policy deep RL (DRL) algorithms enjoy better sample efficiency than and often outperform on-policy algorithms. However, cutting-edge off-policy algorithms still suffer from the low-quality estimation of policy gradients, resulting in compromised learning performance and high sensitivity to hyper-parameter settings. To address this issue, we propose a new concept of robust policy gradient (RPG). Driven by RPG, this paper further develops a new policy ensemble gradient (PEG) algorithm for DRL, inspired by the recent success of several ensemble DRL algorithms. PEG efficiently and effectively estimates RPG by using multiple policy gradients obtained respectively from several off-policy base learners in an ensemble. The estimated RPG is then utilized for training all base learners simultaneously. Comprehensive experiments have been performed on six Mujoco benchmark problems. Compared to four state-of-the-art off-policy algorithms and four cutting-edge ensemble policy gradient algorithms, our new PEG algorithm achieved highly competitive stability, performance and sample efficiency. Further analysis shows that PEG is insensitive to varied hyper-parameter settings, confirming the positive role of RPG in building reliable and effective off-policy DRL algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call