In many real-world decision-making tasks, multi-agent need to learn collaboration in a high-dimensional complex action space, rather than just a single discrete action space. Recently, value decomposition learning methods such as QMIX have emerged as a promising approaches for collaborative multi-agent tasks. However, most of the value decomposition algorithms can only be used in discrete action space, which would limit their practicability. To address the limitation, we propose a novel algorithm called Multi-Agent Sequential Q-Networks (MASQN), which can be applied to the multi-agent domains with continuous, multidiscrete or hybrid action spaces. The proposed algorithm is based on the structure of centralized training with decentralized execution (CTDE). The decentralized actors ensure adaptability to different action spaces by utilizing action space discretization and sequential models, and the centralized critic utilizes the value decomposition architecture to guide effective updates of the policy parameters for each agent. We also give the convergence of joint policy from the perspective of policy iteration, by combining it with the CTDE structure and the constraint of the Individual Global Max (IGM) condition. Finally, we evaluate the MASQN algorithm on two benchmark environments: MAMuJoCo and Hybrid Predator–Prey. The empirical results show that MASQN out performs the state-of-the-art performance on three different action spaces.
Read full abstract