Abstract

Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that has perfect knowledge of the model of how rewards and events are generated in the environment. Rather than being suboptimal, we argue that the learning problem humans face is more complex, in that it also involves learning the structure of reward generation in the environment. We formulate the problem of structure learning in sequential decision tasks using Bayesian reinforcement learning, and show that learning the generative model for rewards qualitatively changes the behavior of an optimal learning agent. To test whether people exhibit structure learning, we performed experiments involving a mixture of one-armed and two-armed bandit reward models, where structure learning produces many of the qualitative behaviors deemed suboptimal in previous studies. Our results demonstrate humans can perform structure learning in a near-optimal manner.

Highlights

  • From a squirrel deciding where to bury its nuts to a scientist selecting the experiment, all decision-making organisms must balance exploration of alternatives against exploitation of known options in developing action plans

  • In an experimental test of structure learning in humans, we show that humans learn reward structure from experience in a near optimal manner

  • We argue that structure learning plays a major role in human sequential decision making

Read more

Summary

Introduction

From a squirrel deciding where to bury its nuts to a scientist selecting the experiment, all decision-making organisms must balance exploration of alternatives against exploitation of known options in developing action plans. Determining when exploration is profitable is itself a decision problem that requires understanding or learning about the statistical structure of the environment. Your aim is to maximize the total reward from the environment, but the difficulty is that the rate of reward for each option is unknown and must be learned. In this simple setting, there may be several hypothesis about how the reward generation process works—how actions, observations and unknowns are structurally ‘‘connected.’’ We propose three kinds of structures that capture several versions of sequential decisionmaking tasks available in the literature. The first structure has temporal dependency between the present probability of reward and the past probability of reward, investigated in the context of

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.