Abstract

How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option’s spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.

Highlights

  • Imagine walking into a nearby supermarket to buy groceries for tonight’s dinner

  • Our results show that participants are able to learn this structure and show improvements in detecting this structure over rounds

  • Over the suite of experiments, the best overall model was the hybrid model that combines the predictions of Gaussian Process regression, using a radial basis function kernel (GP-Radial Basis Function (RBF)), with the predictions generated by the Clustering model (Fig. 18; see Table F1 in Appendix F for detailed results)

Read more

Summary

Introduction

If you had enough time, you could try everything in the store multiple times to get a good sense of what you liked and repeatedly buy what you liked the most. This level of exhaustive exploration is unrealistic. You could explore in proportion to your uncertainty about the value of each item (Auer, Cesa-Bianchi, & Fischer, 2002; Gershman, 2018; Schulz & Gershman, 2019), choosing foods you know little about until you have accumulated enough knowledge to reliably select the ones you like. This paper studies how people combine generalization and uncertainty to guide exploration

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call