Proximal Policy Optimization With Policy Feedback

Yang Gu,C L Philip Chen,Xuesong Wang,Yuhu Cheng

doi:10.1109/tsmc.2021.3098451

Abstract

Proximal policy optimization (PPO) is a deep reinforcement learning algorithm based on the actor–critic (AC) architecture. In the classic AC architecture, the Critic (value) network is used to estimate the value function while the Actor (policy) network optimizes the policy according to the estimated value function. The efficiency of the classic AC architecture is limited due that the policy does not directly participate in the value function update. The classic AC architecture will make the value function estimation inaccurate, which will affect the performance of the PPO algorithm. For improvement, we designed a novel AC architecture with policy feedback (AC-PF) by introducing the policy into the update process of the value function and further proposed the PPO with policy feedback (PPO-PF). For the AC-PF architecture, the policy-based expected (PBE) value function and discount reward formulas are designed by drawing inspiration from expected Sarsa. In order to enhance the sensitivity of the value function to the change of policy and to improve the accuracy of PBE value estimation at the early learning stage, we proposed a policy update method based on the clipped discount factor. Moreover, we specifically defined the loss functions of the policy network and value network to ensure that the policy update of PPO-PF satisfies the unbiased estimation of the trust region. Experiments on Atari games and control tasks show that compared to PPO, PPO-PF has faster convergence speed, higher reward, and smaller variance of reward.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Proximal Policy Optimization With Policy Feedback

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Systems, Man, and Cybernetics: Systems

Lead the way for us

Journal: IEEE Transactions on Systems, Man, and Cybernetics: Systems	Publication Date: Jul 1, 2022
Citations: 22

Similar Papers

End-to-End Deep Policy Feedback-Based Reinforcement Learning Method for Quantization in DNNs
R Logesh Babu ... B D Parameshachari
Journal of Circuits, Systems and Computers | VOL. 31
R Logesh Babu, et. al.R Logesh Babu ... B D Parameshachari
08 Jun 2022
Journal of Circuits, Systems and Computers | VOL. 31

Next-gen resource optimization in NB-IoT networks: Harnessing soft actor–critic reinforcement learning
S Anbazhagan ... R.K Mugelan
Computer Networks | VOL. 252
S Anbazhagan, et. al.S Anbazhagan ... R.K Mugelan
01 Jul 2024
Computer Networks | VOL. 252

An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadrotor
Wentao Xue ... Shuyi Shao
Actuators | VOL. 11
Wentao Xue, et. al.Wentao Xue ... Shuyi Shao
06 Apr 2022
Actuators | VOL. 11

Advantage policy update based on proximal policy optimization
Zilin Zeng ... Dongnan Su
-
Zilin Zeng, et. al.Zilin Zeng ... Dongnan Su
22 Feb 2023
22 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Proximal Policy Optimization With Policy Feedback

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Systems, Man, and Cybernetics: Systems