Cooperative Multi-Agent Reinforcement Learning With Approximate Model Learning

Young Joon Park,Seoung Bum Kim,Young Jae Lee

doi:10.1109/access.2020.3007219

Young Joon Park, Seoung Bum Kim + Show 1 more

Open Access

https://doi.org/10.1109/access.2020.3007219

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 10	License type: CC BY 4.0

Affiliation: Korea University

Abstract

In multi-agent reinforcement learning, it is essential for agents to learn communication protocol to optimize collaboration policies and to solve unstable learning problems. Existing methods based on actor-critic networks solve the communication problem among agents. However, these methods have difficulty in improving sample efficiency and learning robust policies because it is not easy to understand the dynamics and nonstationary of the environment as the policies of other agents change. We propose a method for learning cooperative policies in multi-agent environments by considering the communications among agents. The proposed method consists of recurrent neural network-based actor-critic networks and deterministic policy gradients to centrally train decentralized policies. The actor networks cause the agents to communicate using forward and backward paths and to determine subsequent actions. The critic network helps to train the actor networks by sending gradient signals to the actors according to their contribution to the global reward. To address issues with partial observability and unstable learning, we propose using auxiliary prediction networks to approximate state transitions and the reward function. We used multi-agent environments to demonstrate the usefulness and superiority of the proposed method by comparing it with existing multi-agent reinforcement learning methods, in terms of both learning efficiency and goal achievements in the test phase. The results demonstrate that the proposed method outperformed other alternatives.

Highlights

Reinforcement learning algorithms have garnered attention with their ability to accomplish a wide variety of tasks, such as game playing [26], [33], complex continuous control tasks involving locomotion [22], and those in industrial applications [21]
To learn multiple policies capable of communication, we propose utilizing recurrent actor-critic networks trained by deterministic policy gradients
We evaluate the proposed method in two sets of experiments and compare their performance in centralized training with decentralized execution algorithms based on the actor-critic method, such as multi-agent deep deterministic policy gradient (MADDPG) [24], multi-actor-attention critic (MAAC) [15], and bidirectionally-coordinated networks (BiCNets) [30]

Summary

INTRODUCTION

Reinforcement learning algorithms have garnered attention with their ability to accomplish a wide variety of tasks, such as game playing [26], [33], complex continuous control tasks involving locomotion [22], and those in industrial applications [21]. In the test phase, an agent only receives information based on its observations and can execute an action without one being initiated by other agents This centralized training of decentralized policies has recently attracted attention from the multi-agent reinforcement learning community [13], [15], [24], [31]. To increase the robustness of the learning and promote its full utilization, we propose an auxiliary predictive network that can be readily adapted to model-free methods for approximate model learning This allows the agents to be well-trained even in environments with partial observability. The auxiliary prediction network can be straightforwardly combined with model-free reinforcement learning methods, without any assumptions about the environment

BACKGROUND

DEEP Q-LEARNING

POLICY GRADIENTS

DEEP DETERMINISTIC POLICY GRADIENTS

PROPOSED METHOD

26: Update the target network parameters

CONCLUSIONS