A policy optimization algorithm based on sample adaptive reuse and dual-clipping for robotic action control

Li-Yang Zhao,Tian-Qing Chang,Jie Zhang,Lei Zhang,Kai-Xuan Chu,Li-Bin Guo,De-Peng Kong

doi:10.1016/j.asoc.2022.109967

Li-Yang Zhao, Tian-Qing Chang + Show 5 more

https://doi.org/10.1016/j.asoc.2022.109967

Copy DOI

Export

Save

Cite

Journal: Applied Soft Computing	Publication Date: Dec 28, 2022
Citations: 3

Affiliation: PLA 306 Hospital

Abstract
Full-Text
Similar Papers

Abstract

Listen

When applying deep reinforcement learning in the real physical environment for decision-making, how to improve the sample efficiency while ensuring training stability is an urgent problem that needs to be solved. In order to solve this problem, some of on-policy algorithms are proposed and have achieved state-of-the-art performance. However, these on-policy algorithms, such as proximal policy optimization (PPO) algorithm, have the drawback of extremely low sample efficiency. In this study, we proposed a novel policy optimization method named improved proximal policy optimization algorithm based on sample adaptive reuse and dual-clipping (SARD-PPO) for robotic action control, which combines the advantage of the on-policy methods in training stability with the advantage of the off-policy methods in sample efficiency. First, we analyzed the clipping mechanism of the PPO algorithm, devised a more constrained clipping mechanism based on the analysis of the relationship between the clipping mechanism and the objective constraints, and developed a policy updating method that reuses the old samples of the prior policy in a more principle-based way. Second, we ensured the training stability of the algorithm through element-level dual-clipping, as well as adaptive adjustment and reuse of the entire policy trajectory. The experimental results on six tasks in the MuJoCo benchmark indicate that SARD-PPO can significantly improve policy performance while balancing policy training stability and sample efficiency, outperforming the baseline PPO algorithm and other SOTA policy gradient methods using on- and off-policy samples in terms of overall performance.

Full Text