Trust region policy optimization (TRPO) is one of the landmark policy optimization algorithms in deep reinforcement learning. Its purpose is to maximize a surrogate objective based on an advantage function, subject to the limited Kullback–Leibler (KL) divergence of two consecutive policies. Although there have been many successful applications of this algorithm in the literature, the approach has often been criticized for suppressing the exploration ability of some application environments due to its strict divergence constraint. As such, most researchers prefer to use entropy regularization, which is added to the expected discounted rewards or the surrogate objectives. That said, there is much debate about whether there might be an alternative strategy for regularizing TRPOs. In this paper, we present just that. Our strategy is to regularize the KL divergence-based constraint via Shannon entropy. This approach enlarges the difference between two consecutive policies and thus derives a new TRPO scheme with entropy regularization for use with KL divergence constraint. Next, the surrogate objective and Shannon entropy are approximated linearly, while the KL divergence is expanded quadratically. An efficient conjugate gradient optimization procedure then solves two sets of linear equations, providing a detailed code-level implementation that can be used for a fair experimental comparison. Extensive experiments within eight benchmark environments demonstrate that our proposed method is superior to both the original TRPO and the entropy regularized objective TRPO. Further, theoretical and experimental analysis shows that three TRPO-like methods have an equal time complexity and a close computational burden.
Read full abstract