Abstract

In multi-agent reinforcement learning, it is essential for agents to learn communication protocol to optimize collaboration policies and to solve unstable learning problems. Existing methods based on actor-critic networks solve the communication problem among agents. However, these methods have difficulty in improving sample efficiency and learning robust policies because it is not easy to understand the dynamics and nonstationary of the environment as the policies of other agents change. We propose a method for learning cooperative policies in multi-agent environments by considering the communications among agents. The proposed method consists of recurrent neural network-based actor-critic networks and deterministic policy gradients to centrally train decentralized policies. The actor networks cause the agents to communicate using forward and backward paths and to determine subsequent actions. The critic network helps to train the actor networks by sending gradient signals to the actors according to their contribution to the global reward. To address issues with partial observability and unstable learning, we propose using auxiliary prediction networks to approximate state transitions and the reward function. We used multi-agent environments to demonstrate the usefulness and superiority of the proposed method by comparing it with existing multi-agent reinforcement learning methods, in terms of both learning efficiency and goal achievements in the test phase. The results demonstrate that the proposed method outperformed other alternatives.

Highlights

  • Reinforcement learning algorithms have garnered attention with their ability to accomplish a wide variety of tasks, such as game playing [26], [33], complex continuous control tasks involving locomotion [22], and those in industrial applications [21]

  • To learn multiple policies capable of communication, we propose utilizing recurrent actor-critic networks trained by deterministic policy gradients

  • We evaluate the proposed method in two sets of experiments and compare their performance in centralized training with decentralized execution algorithms based on the actor-critic method, such as multi-agent deep deterministic policy gradient (MADDPG) [24], multi-actor-attention critic (MAAC) [15], and bidirectionally-coordinated networks (BiCNets) [30]

Read more

Summary

INTRODUCTION

Reinforcement learning algorithms have garnered attention with their ability to accomplish a wide variety of tasks, such as game playing [26], [33], complex continuous control tasks involving locomotion [22], and those in industrial applications [21]. In the test phase, an agent only receives information based on its observations and can execute an action without one being initiated by other agents This centralized training of decentralized policies has recently attracted attention from the multi-agent reinforcement learning community [13], [15], [24], [31]. To increase the robustness of the learning and promote its full utilization, we propose an auxiliary predictive network that can be readily adapted to model-free methods for approximate model learning This allows the agents to be well-trained even in environments with partial observability. The auxiliary prediction network can be straightforwardly combined with model-free reinforcement learning methods, without any assumptions about the environment

BACKGROUND
DEEP Q-LEARNING
POLICY GRADIENTS
DEEP DETERMINISTIC POLICY GRADIENTS
PROPOSED METHOD
26: Update the target network parameters
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call