Deep Multi-Critic Network for accelerating Policy Learning in multi-agent environments

Joosep Hook,Varuna De Silva,Ahmet Kondoz

doi:10.1016/j.neunet.2020.04.023

Joosep Hook, Varuna De Silva + Show 1 more

Open Access

https://doi.org/10.1016/j.neunet.2020.04.023

Copy DOI

Journal: Neural Networks	Publication Date: May 4, 2020
Citations: 4	License type: cc-by

Affiliation: Loughborough University

Abstract

Humans live among other humans, not in isolation. Therefore, the ability to learn and behave in multi-agent environments is essential for any autonomous system that intends to interact with people. Due to the presence of multiple simultaneous learners in a multi-agent learning environment, the Markov assumption used for single-agent environments is not tenable, necessitating the development of new Policy Learning algorithms. Recent Actor–Critic algorithms proposed for multi-agent environments, such as Multi-Agent Deep Deterministic Policy Gradients and Counterfactual Multi-Agent Policy Gradients, find a way to use the same mathematical framework as single agent environments by augmenting the Critic with extra information. However, this extra information can slow down the learning process and afflict the Critic with Curse of Dimensionality. To combat this, we propose a novel Deep Neural Network configuration called Deep Multi-Critic Network. This architecture works by taking a weighted sum over the outputs of multiple critic networks of varying complexity and size. The configuration was tested on data collected from a real-world multi-agent environment. The results illustrate that by using Deep Multi-Critic Network, less data is needed to reach the same level of performance as when not using the configuration. This suggests that as the configuration learns faster from less data, then the Critic may be able to learn Q-values faster, accelerating Actor training as well.

Highlights

The main purpose of Policy Learning is teaching a desired behaviour to an agent
Deep Multi-Critic Network contains more than 10 times the trainable parameters in the Baseline model, the performance advantage for Deep Multi-Critic Network is preserved even when 0.05% of data is used for training
The results suggest that Deep Multi-Critic Network uses available data more efficiently than the Baseline model

Summary

Introduction

The main purpose of Policy Learning is teaching a desired behaviour to an agent. The autonomous agent, either physical (a robot) or virtual (a software), exists in some environment it can influence by acting intelligently. The environment provides positive (or negative) feedback to the agent after taking an action. This feedback helps the agent learn how to act in order to get the most positive reward. In addition to board games, Reinforcement Learning has been used to learn how to play video games better than humans (Mnih et al, 2015) and to teach a robot hand to mimic actions as demonstrated by a human (Finn, Levine, & Abbeel, 2016)

Results

Discussion

Conclusion