Reinforcement Learning Agents acquire Flocking and Symbiotic Behaviour in Simulated Ecosystems

Guy Lever,Edward Hughes,Joel Z. Leibo,Nicolas Heess,Thore Graepel,Josh Merel,Siqi Liu,Peter Sunehag,Tom Eccles

doi:10.1162/isal_a_00148

Abstract

In nature, group behaviours such as flocking as well as cross-species symbiotic partnerships are observed in vastly different forms and circumstances. We hypothesize that such strategies can arise in response to generic predator-prey pressures in a spatial environment with range-limited sensation and action. We evaluate whether these forms of coordination can emerge by independent multi-agent reinforcement learning in simple multiple-species ecosystems. In contrast to prior work, we avoid hand-crafted shaping rewards, specific actions, or dynamics that would directly encourage coordination across agents. Instead we test whether coordination emerges as a consequence of adaptation without encouraging these specific forms of coordination, which only has indirect benefit. Our simulated ecosystems consist of a generic food chain involving three trophic levels: apex predator, mid-level predator, and prey. We conduct experiments on two different platforms, a 3D physics engine with tens of agents as well as in a 2D grid world with up to thousands. The results clearly confirm our hypothesis and show substantial coordination both within and across species. To obtain these results, we leverage and adapt recent advances in deep reinforcement learning within an ecosystem training protocol featuring homogeneous groups of independent agents from different species (sets of policies), acting in many different random combinations in parallel habitats. The policies utilize neural network architectures that are invariant to agent individuality but not type (species) and that generalize across varying numbers of observed other agents. While the emergence of complexity in artificial ecosystems have long been studied in the artificial life community, the focus has been more on individual complexity and genetic algorithms or explicit modelling, and less on group complexity and reinforcement learning emphasized in this article. Unlike what the name and intuition suggests, reinforcement learning adapts over evolutionary history rather than a life-time and is here addressing the sequential optimization of fitness that is usually approached by genetic algorithms in the artificial life community. We utilize a shift from procedures to objectives, allowing us to bring new powerful machinery to bare, and we see emergence of complex behaviour from a sequence of simple optimization problems.

Highlights

Our natural world is the ultimate example of a selforganizing system (Ashby, 1947)
The experiences are gathered and sampled from for each policy, which is learned independently through updates performed to its network weights using state-of-the-art reinforcement learning (RL) algorithms; Maximum a-posteriori Policy Optimization (MPO) (Abdolmaleki et al, 2018) for the continuous case and Impala (Espeholt et al, 2018)
We introduce a physics based food chain environment and describe its observations and rewards, as well as how agents process these and learn in an ecosystem training framework with three species

Summary

Introduction

Our natural world is the ultimate example of a selforganizing system (Ashby, 1947). Species and individuals adapt to each other in competition and cooperation, often as predators and prey in food chains. To investigate if at a large scale, like Yang et al (2018), we see population dynamics of a form that in some ways resemble nature (e.g. oscillations around a mean), we introduce a variation with spawning and vanishing (from predation) agents This enables the population levels to reflect the success of the species (policy). While this kind of pattern was seen recently by Yang et al (2018) in a grid world, they relied on an explicit ”join group” action and introduce prey explicitly requiring sufficiently large groups to hunt. 3D physics and continuous control of forces, provide a rich world allowing for more realistic and explicit behaviours, but can be more difficult to learn It requires a long sequence of actions to perform an apparently simple maneuver. The experiences are gathered and sampled from for each policy, which is learned independently through updates performed to its network weights using state-of-the-art RL algorithms; Maximum a-posteriori Policy Optimization (MPO) (Abdolmaleki et al, 2018) for the continuous case and Impala (Espeholt et al, 2018)

A Physics-Based Food Chain Environment

Experiments

Conclusions