Finding structure in multi-armed bandits

Eric Schulz,Nicholas T Franklin,Samuel J Gershman

doi:10.1016/j.cogpsych.2019.101261

Eric Schulz, Nicholas T Franklin + Show 1 more

Open Access

https://doi.org/10.1016/j.cogpsych.2019.101261

Copy DOI

Journal: Cognitive psychology	Publication Date: Feb 12, 2020
Citations: 31	License type: publisher-specific-oa

Affiliation: Harvard University Press

Abstract

How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option’s spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.

Highlights

Imagine walking into a nearby supermarket to buy groceries for tonight’s dinner
Our results show that participants are able to learn this structure and show improvements in detecting this structure over rounds
Over the suite of experiments, the best overall model was the hybrid model that combines the predictions of Gaussian Process regression, using a radial basis function kernel (GP-Radial Basis Function (RBF)), with the predictions generated by the Clustering model (Fig. 18; see Table F1 in Appendix F for detailed results)

Summary

Introduction

If you had enough time, you could try everything in the store multiple times to get a good sense of what you liked and repeatedly buy what you liked the most. This level of exhaustive exploration is unrealistic. You could explore in proportion to your uncertainty about the value of each item (Auer, Cesa-Bianchi, & Fischer, 2002; Gershman, 2018; Schulz & Gershman, 2019), choosing foods you know little about until you have accumulated enough knowledge to reliably select the ones you like. This paper studies how people combine generalization and uncertainty to guide exploration

Methods

Results

Conclusion