Abstract

In natural environments with many options of uncertain value, one faces a difficult tradeoff between exploiting familiar, valuable options or searching for better alternatives. Reinforcement learning models of this exploration/exploitation dilemma typically modulate the rate of exploratory choices or preferentially sample uncertain options. The extent to which such models capture human behavior remains unclear, in part because they do not consider the constraints on remembering what is learned.Using reinforcement-based timing as a motivating example, we show that selectively maintaining high-value actions compresses the amount of information to be tracked in learning, as quantified by Shannon's entropy. In turn, the information content of the value representation controls the balance between exploration (high entropy) and exploitation (low entropy). Selectively maintaining preferred action values while allowing others to decay renders the choices increasingly exploitative across learning episodes.To adjudicate among alternative maintenance and sampling strategies, we developed a new reinforcement learning model, StrategiC ExPloration/ExPloitation of Temporal Instrumental Contingencies (SCEPTIC). In computational studies, a resource-rational selective maintenance approach was as successful as more resource-intensive strategies. Furthermore, human behavior was consistent with selective maintenance; information compression was most pronounced in subjects with superior performance and non-verbal intelligence, and in learnable vs. unlearnable contingencies. Cognitively demanding uncertainty-directed exploration recovered a more accurate representation in simulations with no foraging advantage and was strongly unsupported in our human study.

Highlights

  • It is better to understand a little than to misunderstand a lot. - Anatole FranceLaboratory studies of value-based decision-making typically involve choosing among a few actions according to their perceived subjective value (Glimcher & Fehr, 2013)

  • To rule out the possibility that the selective maintenance model fit well because it better represented sticky choices, we extended Strategic ExPloration/ExPloitation of Temporal Instrumental Contingencies (SCEPTIC) models with two choice autocorrelation functions (ACF): a simple first-order autoregressive (AR[1]) ACF and an ACF extended over multiple trials (Schönberg, Daw, Joel, & O’Doherty, 2007)

  • SCEPTIC: the impact of uncertainty on exploration.—To embody the alternative hypothesis that exploration is modulated by uncertainty, we developed SCEPTIC variants where choice was influenced by both uncertainty (U), estimated by Bayesian filtering, and reward value (V)

Read more

Summary

Introduction

It is better to understand a little than to misunderstand a lot. - Anatole FranceLaboratory studies of value-based decision-making typically involve choosing among a few actions according to their perceived subjective value (Glimcher & Fehr, 2013). How can an organism with limited computational resources learn the most advantageous actions in the natural environment? Previous work on boundedly rational agents has considered the role of a limited-capacity working memory system (Collins & Frank, 2012) and the possibility that metareasoning (i.e., a policy guiding how to allocate resources) reduces the complexity of learning in large action spaces (Russell & Wefald, 1991). This study provides a new, complementary account highlighting how the selective maintenance of value information facilitates the search for the best among many actions. One of the fundamental dilemmas in reinforcement learning is how to choose between exploiting an action with a known positive value and exploring alternatives in search of even more advantageous actions (Sutton & Barto, 1998).

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call