Exploration Bonuses Research Articles

This paper considers agency in the setting of embodied or active inference. In brief, we associate a sense of agency with prior beliefs about action and ask what sorts of beliefs underlie optimal behavior. In particular, we consider prior beliefs that action minimizes the Kullback–Leibler (KL) divergence between desired states and attainable states in the future. This allows one to formulate bounded rationality as approximate Bayesian inference that optimizes a free energy bound on model evidence. We show that constructs like expected utility, exploration bonuses, softmax choice rules and optimism bias emerge as natural consequences of this formulation. Previous accounts of active inference have focused on predictive coding and Bayesian filtering schemes for minimizing free energy. Here, we consider variational Bayes as an alternative scheme that provides formal constraints on the computational anatomy of inference and action—constraints that are remarkably consistent with neuroanatomy. Furthermore, this scheme contextualizes optimal decision theory and economic (utilitarian) formulations as pure inference problems. For example, expected utility theory emerges as a special case of free energy minimization, where the sensitivity or inverse temperature (of softmax functions and quantal response equilibria) has a unique and Bayes-optimal solution—that minimizes free energy. This sensitivity corresponds to the precision of beliefs about behavior, such that attainable goals are afforded a higher precision or confidence. In turn, this means that optimal behavior entails a representation of confidence about outcomes that are under an agent's control.

Event Abstract Back to Event Discounting as task termination, and its implications Paul Schrater1* and Constantin Rothkopf2 1 University of Minnesota, United States 2 Frankfurt Institute for Advanced Studies, Germany Discounting is a natural part of the formulation of most sequential decisions problems and has been reported to underlie human and animal behavior in numerous empirical studies. Yet, typically discounting is treated as an arbitrary parameter needed to bound the expected future reward, and is treated as a fixed constant for all states. However, rather than an arbitrary parameter, discounting can also be interpreted as the probability of task termination. By assigning individual discount rates to separate states, we show it provides a framework for exploration bonuses, a rational basis for bounded computation and provides a basis for the automatic construction of stochastic options, or macro actions. Our results capitalize on work by Sonin (2008), who shows a generalization of the Gittins index to a Markov chain that allows for state dependent discount rates. This generalized index shows that the exploratory bonus for an option is the reciprocal of the probability of task termination. Moreover, there is a recursive algorithm to compute the generalized index, which produces an ordering of states in terms of the obtainable reward. Furthermore states that are traversed on the way to high-reward states can be eliminated, producing abstract states with new corresponding transition dynamics, and the probability of task-termination on these abstract states. We show how to use this elimination method to construct stochastic options (sub-policies) that find these abstract states, and that the choice between options can be computed via an index function. Using the termination probability formulation, we can derive exploratory incentives for each option and show that the effect of transition uncertainty is to reduce exploration incentive - in particular, incentive is reduced by the uncertainty over the set of next states - effectively a branching factor on the look ahead. This result provides a rational basis for bounding computation in model-based reinforcement learning. We apply the framework to a well-known reinforcement learning problem that is challenging for exploration - the "chain game." Our analysis decomposes the problem into a simple binary choice between two options, given enough experience with the transition probabilities, and we can quantify the difficulty in learning the better option. Humans placed in this environment fell into one of two distinct groups; one group performed enough unrewarded exploratory actions to find the better option while the second group under-explored and found the worst option. In debriefing, subjects in the former group reported finding worse option quickly, but believed that higher rewards were possible and thus continued exploration. Conversely, subjects in the latter group typically reported an initial exploratory phase, but upon finding the worse option believed the rest of the chain wasn’t worth exploring. We believe these results provide new insight into the relationship between abstraction, exploration and the overall chance of task termination, and give a computational basis for understanding why exploration should be tied to competence or effectiveness, i.e. the ability to complete the task. Conference: Computational and Systems Neuroscience 2010, Salt Lake City, UT, United States, 25 Feb - 2 Mar, 2010. Presentation Type: Poster Presentation Topic: Poster session III Citation: Schrater P and Rothkopf C (2010). Discounting as task termination, and its implications. Front. Neurosci. Conference Abstract: Computational and Systems Neuroscience 2010. doi: 10.3389/conf.fnins.2010.03.00135 Copyright: The abstracts in this collection have not been subject to any Frontiers peer review or checks, and are not endorsed by Frontiers. They are made available through the Frontiers publishing platform as a service to conference organizers and presenters. The copyright in the individual abstracts is owned by the author of each abstract or his/her employer unless otherwise stated. Each abstract, as well as the collection of abstracts, are published under a Creative Commons CC-BY 4.0 (attribution) licence (https://creativecommons.org/licenses/by/4.0/) and may thus be reproduced, translated, adapted and be the subject of derivative works provided the authors and Frontiers are attributed. For Frontiers’ terms and conditions please see https://www.frontiersin.org/legal/terms-and-conditions. Received: 01 Mar 2010; Published Online: 01 Mar 2010. * Correspondence: Paul Schrater, University of Minnesota, Minneapolis, United States, schrater@cs.umn.edu Login Required This action requires you to be registered with Frontiers and logged in. To register or login click here. Abstract Info Abstract The Authors in Frontiers Paul Schrater Constantin Rothkopf Google Paul Schrater Constantin Rothkopf Google Scholar Paul Schrater Constantin Rothkopf PubMed Paul Schrater Constantin Rothkopf Related Article in Frontiers Google Scholar PubMed Abstract Close Back to top Javascript is disabled. Please enable Javascript in your browser settings in order to see all the content on this page.

Exploration Bonuses Research Articles

Articles published on Exploration Bonuses

Analysis of Performance Measure in Q Learning with UCB Exploration

A Policy for Optimizing Sub-Band Selection Sequences in Wideband Spectrum Sensing

The anatomy of choice: active inference and agency

Discounting as task termination, and its implications

Dopamine: generalization and bonuses

Exploration bonuses and dual control

Exploration bonuses and dual control

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Exploration Bonuses Research Articles

Articles published on Exploration Bonuses

Analysis of Performance Measure in Q Learning with UCB Exploration

A Policy for Optimizing Sub-Band Selection Sequences in Wideband Spectrum Sensing

The anatomy of choice: active inference and agency

Discounting as task termination, and its implications

Dopamine: generalization and bonuses

Exploration bonuses and dual control

Exploration bonuses and dual control