Abstract

Behavioral evidence suggests that instrumental conditioning is governed by two forms of action control: a goal-directed and a habit learning process. Model-based reinforcement learning (RL) has been argued to underlie the goal-directed process; however, the way in which it interacts with habits and the structure of the habitual process has remained unclear. According to a flat architecture, the habitual process corresponds to model-free RL, and its interaction with the goal-directed process is coordinated by an external arbitration mechanism. Alternatively, the interaction between these systems has recently been argued to be hierarchical, such that the formation of action sequences underlies habit learning and a goal-directed process selects between goal-directed actions and habitual sequences of actions to reach the goal. Here we used a two-stage decision-making task to test predictions from these accounts. The hierarchical account predicts that, because they are tied to each other as an action sequence, selecting a habitual action in the first stage will be followed by a habitual action in the second stage, whereas the flat account predicts that the statuses of the first and second stage actions are independent of each other. We found, based on subjects' choices and reaction times, that human subjects combined single actions to build action sequences and that the formation of such action sequences was sufficient to explain habitual actions. Furthermore, based on Bayesian model comparison, a family of hierarchical RL models, assuming a hierarchical interaction between habit and goal-directed processes, provided a better fit of the subjects' behavior than a family of flat models. Although these findings do not rule out all possible model-free accounts of instrumental conditioning, they do show such accounts are not necessary to explain habitual actions and provide a new basis for understanding how goal-directed and habitual action control interact.

Highlights

  • There is considerable evidence from studies of instrumental conditioning in rats and humans that the performance of rewardrelated actions reflects the involvement of two learning processes, one controlling the acquisition of goal-directed actions and the other of habits [1,2,3,4]

  • Habits are usually described as single step actions, their tendency to combine or chunk with other actions [9,10,11,12,13,14,15] and their insensitivity to changes in the value of, and the causal relationship to, their consequences [2,16] suggests that they may best be viewed as action sequences [8]

  • The evaluation of action sequences is divorced from offline environmental changes in individual action-outcome contingencies or the value of outcomes inside the sequence boundaries and, as they are no longer guided by the model of the environment [8], are executed irrespective of the outcome of each individual action [12,17]; i.e., the actions run off in an order predetermined by the sequence, without requiring immediate feedback

Read more

Summary

Introduction

There is considerable evidence from studies of instrumental conditioning in rats and humans that the performance of rewardrelated actions reflects the involvement of two learning processes, one controlling the acquisition of goal-directed actions and the other of habits [1,2,3,4] This evidence suggests that goal-directed decisionmaking involves deliberating over the consequences of alternative actions in order to predict their outcomes after which action selection is guided by the value of the predicted outcome of each action. Habitual actions reflect the tendency of individuals to repeat behaviors that have led to desirable outcomes in the past and respect neither their causal relationship to, nor the value of their consequences As such, they are not guided by a model of the environment, and are relatively inflexible in the face of environmental changes [5,6,7]. The evaluation of action sequences is divorced from offline environmental changes in individual action-outcome contingencies or the value of outcomes inside the sequence boundaries and, as they are no longer guided by the model of the environment [8], are executed irrespective of the outcome of each individual action [12,17]; i.e., the actions run off in an order predetermined by the sequence, without requiring immediate feedback

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call