Model-free reinforcement learning algorithms based on entropy regularized have achieved good performance in control tasks. Those algorithms consider using the entropy-regularized term for the policy to learn a stochastic policy. This work provides a new perspective that aims to explicitly learn a representation of intrinsic information in state transition to obtain a multimodal stochastic policy, for dealing with the tradeoff between exploration and exploitation. We study a class of Markov decision processes (MDPs) with divergence maximization, called divergence MDPs. The goal of the divergence MDPs is to find an optimal stochastic policy that maximizes the sum of both the expected discounted total rewards and a divergence term, where the divergence function learns the implicit information of state transition. Thus, it can provide better-off stochastic policies to improve both in robustness and performance in a high-dimension continuous setting. Under this framework, the optimality equations can be obtained, and then a divergence actor-critic algorithm is developed based on the divergence policy iteration method to address large-scale continuous problems. The experimental results, compared to other methods, show that our approach achieved better performance and robustness in the complex environment particularly. The code of DivAC can be found in https://github.com/yzyvl/DivAC.
Read full abstract