Proximal Policy Optimization with Future rewards

Chengcheng Yu,Lijun Zhang,Dezhong Peng,Dawei Yin,Haixiao Huang

doi:10.1088/1742-6596/2010/1/012085

Abstract

Among the current reinforcement learning algorithms, the Policy Gradient algorithm (PG)[7] is one of the traditional and most widely used algorithms, but it has the disadvantage of unstable gradient estimation, and the newly Proximal Policy Optimization algorithm (PPO) [8]solves the problem. It solves the stability problem, but the update policy is slow, and it is easy to produce over-fitting when the training times are too many. In this article, a new method is proposed, referring to Asynchronous Advantage Actor-Critic (A3C)[9], the basic PPO algorithm is trained in parallel, and a method that considers future rewards is introduced, and the future rewards are also calculated to the current. In the reward, through the OPEN GYM platform and the experimental results of capturing at any position of the robotic arm, our actions can ensure a faster training speed, while also avoiding overfitting during long-term training.

Full Text