Deep Deterministic Policy Gradient With Compatible Critic Network.

Di Wang,Mengqi Hu

doi:10.1109/tnnls.2021.3117790

Di Wang, Mengqi Hu

Open Access

https://doi.org/10.1109/tnnls.2021.3117790

Copy DOI

Journal: IEEE transactions on neural networks	Publication Date: Aug 1, 2023
Citations: 13	License type: publisher-specific, author manuscript

Affiliation: University of Illinois at Chicago

Abstract

Deep deterministic policy gradient (DDPG) is a powerful reinforcement learning algorithm for large-scale continuous controls. DDPG runs the back-propagation from the state-action value function to the actor network's parameters directly, which raises a big challenge for the compatibility of the critic network. This compatibility emphasizes that the policy evaluation is compatible with the policy improvement. As proved in deterministic policy gradient, the compatible function guarantees the convergence ability but restricts the form of the critic network tightly. The complexities and limitations of the compatible function impede its development in DDPG. This article introduces neural networks' similarity indices with gradients to measure the compatibility concretely. Represented as kernel matrices, we consider the actor network's and the critic network's training dataset, trained parameters, and gradients. With the sketching trick, the calculation time of the similarity index decreases hugely. The centered kernel alignment index and the normalized Bures similarity index provide us with consistent compatibility scores empirically. Moreover, we demonstrate the necessity of the compatible critic network in DDPG from three aspects: 1) analyzing the policy improvement/evaluation steps; 2) conducting the theoretic analysis; and 3) showing the experimental results. Following our research, we remodel the compatible function with an energy function model, enabling it suitable to the sizeable state-action space problem. The critic network has higher compatibility scores and better performance by introducing the policy change information into the critic-network optimization process. Besides, based on our experiment observations, we propose a light-computation overestimation solution. To prove our algorithm's performance and validate the compatibility of the critic network, we compare our algorithm with six state-of-the-art algorithms using seven PyBullet robotics environments.

Full Text