Abstract

In the framework of current off-policy actor–critic methods, the state–action pairs in an experience replay buffer (called historical behaviors) cannot be used to improve the policy, and the target network and the clipped double Q-learning techniques need to be used to evaluate the policy. The framework limits the policy learning capability in complex environments, and needs to maintain four critic networks. As a result, we propose an efficient and lightweight off-policy actor–critic (EL-AC) framework. In the policy improvement, an efficient off-policy likelihood ratio policy gradient algorithm with historical behaviors reusing (PG-HBR) is proposed, which promotes the agent to learn an approximately optimal policy by using the historical behaviors. Moreover, a theoretically interpretable universal critic network is designed. It can approximate the action-value and the state-value functions simultaneously, so as to obtain the advantage function in PG-HBR. In the policy evaluation, we develop the algorithms of low-pass filtering for target state-values and adaptive controlling algorithm for overestimation bias, which can evaluate the policy efficiently and accurately using only one universal critic network. Extensive evaluation results indicate that EL-AC outperforms the state-of-the-art off-policy actor–critic methods in terms of approximately optimal policy learning and neural network storage space occupation, and it is more suitable for policy learning in complex environments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call