Abstract

Proximal policy optimization (PPO) has yielded state-of-the-art results in policy search, a subfield of reinforcement learning, with one of its key points being the use of a surrogate objective function to restrict the step size at each policy update. Although such restriction is helpful, the algorithm still suffers from performance instability and optimization inefficiency from the sudden flattening of the curve. To address this issue we present a novel functional clipping policy optimization algorithm, named Proximal Policy Optimization Smoothed Algorithm (PPOS), and its critical improvement is the use of a functional clipping method instead of a flat clipping method. We compare our approach with PPO and PPORB, which adopts a rollback clipping method, and prove that our approach can conduct more accurate updates than other PPO methods. We show that it outperforms the latest PPO variants on both performance and stability in challenging continuous control tasks. Moreover, we provide an instructive guideline for tuning the main hyperparameter in our algorithm.

Highlights

  • Reinforcement learning, especially deep modelfree reinforcement learning, has achieved great progress in recent years

  • Inspired by the insights above, we propose a novel PPO clipping method, named Proximal Policy Optimization Smoothed algorithm (PPOS) which combines the strengths of both PPO and Policy Optimization with Rollback (PPORB)

  • TrustRegion Optimization (TRO) methods are used for keeping the new policy not far away from the old policy, which was first introduced in the relative entropy policy search (REPS) algorithm [17]

Read more

Summary

INTRODUCTION

Reinforcement learning, especially deep modelfree reinforcement learning, has achieved great progress in recent years. The subjecting item is computationally inefficient and difficult to scale up for high-dimensional problems when extending to complex network architectures To address this problem, the Proximal Policy Optimization (PPO), which adopts a clipping mechanism on the probability of likelihood ratio, was introduced [20]. PPORB adopts a straight downward-slope function instead of the original flat function when the ratio is out of the clipping range, which hinders the natural incentive from PPO to seek a large policy update. This solution introduces new problems, such as when the ratio is extremely big and the clipped ratio will shoot to infinity or negative infinity, and contrast with its original aim. We analyze the hyperparameter here introduced in relation to the dimensions of five problems and provide a useful guideline for readers to orient their hyperparametric choices according to the dimension of their problems

PRELIMINARIES
PROXIMAL POLICY OPTIMIZATION
EXPERIMENTS
CHOICE OF THE HYPERPARAMETER
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call