Abstract

The joint action-value function (JAVF) plays a key role in the centralized training of multi-agent deep reinforcement learning (MADRL)-based algorithms using the value function decomposition (VFD) and in the generating process of a collaborative policy between agents. However, under the influence of multiple factors such as environmental noise, inadequate exploration and iterative updating mechanism, estimation bias is inevitably introduced, causing its overestimation problem, which in turn prevents agents from obtaining accurate reward signals during the learning process, and fails to correctly approximate the optimal policy. To address this problem, this paper first analyzes the causes of joint action-value function overestimation, gives the corresponding mathematical proofs and theoretical derivations, and obtains the lower bound of the overestimation error; then, a MADRL overestimation reduction method based on the multi-step weighted double estimation named λWD QMIX is proposed. Specifically, the λWD QMIX method effectively achieves more stable and accurate JAVF estimation results using the bias correction estimation mechanisms based on the weighted double estimation and multi-step updating based on eligibility trace backup, without additionally adding or changing any network structure. The results of a series of experiments on the StarCraft II micromanipulation benchmark show that the proposed λWD QMIX algorithm can effectively improve the final performance and learning efficiency of the baseline algorithm, and can be seamlessly integrated with the partially MADRL algorithms based on communication learning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call