Model based planners reflect on their model-free propensities.

Rani Moran,Raymond J Dolan,Mehdi Keramati

doi:10.1371/journal.pcbi.1008552

Rani Moran, Raymond J Dolan + Show 1 more

Open Access

https://doi.org/10.1371/journal.pcbi.1008552

Copy DOI

Abstract

Dual-reinforcement learning theory proposes behaviour is under the tutelage of a retrospective, value-caching, model-free (MF) system and a prospective-planning, model-based (MB), system. This architecture raises a question as to the degree to which, when devising a plan, a MB controller takes account of influences from its MF counterpart. We present evidence that such a sophisticated self-reflective MB planner incorporates an anticipation of the influences its own MF-proclivities exerts on the execution of its planned future actions. Using a novel bandit task, wherein subjects were periodically allowed to design their environment, we show that reward-assignments were constructed in a manner consistent with a MB system taking account of its MF propensities. Thus, in the task participants assigned higher rewards to bandits that were momentarily associated with stronger MF tendencies. Our findings have implications for a range of decision making domains that includes drug abuse, pre-commitment, and the tension between short and long-term decision horizons in economics.

Highlights

We focus on a common situation wherein a goaldirected Reinforcement Learning (RL) agent can choose or design an environment within which it will later seek rewards
Previous evidence showing that biological agents rely on dual MF-MB systems [7,20] raises questions as to the nature and extent of system-interactions that govern overt behaviour
An extensive RL literature suggests these interactions are governed by diverse processes including a speed accuracy trade-off [29], trainer-actor dichotomy[8,30], MF reinforcement of MB-goals [13], reliability-based arbitration [31] and retrospective MB inference guiding MF credit assignment [14]

Summary

Introduction

Model based planners reflect on their model-free propensities in study design, data collection and analysis, decision to publish, or preparation of the manuscript

Objectives

Methods

Results

Discussion

Conclusion