Abstract

Temporal-difference (TD) algorithms have been proposed as models of reinforcement learning (RL). We examine two issues of distributed representation in these TD algorithms: distributed representations of belief and distributed discounting factors. Distributed representation of belief allows the believed state of the world to distribute across sets of equivalent states. Distributed exponential discounting factors produce hyperbolic discounting in the behavior of the agent itself. We examine these issues in the context of a TD RL model in which state-belief is distributed over a set of exponentially-discounting “micro-Agents”, each of which has a separate discounting factor (γ). Each µAgent maintains an independent hypothesis about the state of the world, and a separate value-estimate of taking actions within that hypothesized state. The overall agent thus instantiates a flexible representation of an evolving world-state. As with other TD models, the value-error (δ) signal within the model matches dopamine signals recorded from animals in standard conditioning reward-paradigms. The distributed representation of belief provides an explanation for the decrease in dopamine at the conditioned stimulus seen in overtrained animals, for the differences between trace and delay conditioning, and for transient bursts of dopamine seen at movement initiation. Because each µAgent also includes its own exponential discounting factor, the overall agent shows hyperbolic discounting, consistent with behavioral experiments.

Highlights

  • Temporal-difference (TD) learning algorithms have been proposed to model behavioral reinforcement learning (RL) [1,2,3]

  • Levy’s theory predicted that it should take some time for that set of intervening states to develop [62]; before the system has settled on a set of intervening states, mAgents would distribute themselves among the large set of potential states, producing an equivalent-set-like effect

  • If explicit, salient markers were to be provided to the ITI state, animals should show a faster transfer of delta across the ITI gap, and a faster decrease in the delta signal at the cue stimulus (CS)

Read more

Summary

Introduction

Temporal-difference (TD) learning algorithms have been proposed to model behavioral reinforcement learning (RL) [1,2,3]. The value-estimate of the old state can be updated to approach the observed value This d signal appears at unexpected rewards, transfers with learning from rewards to anticipatory cue stimuli, and shifts with changes in anticipated reward [4,8]. This algorithm is a generalization of the early psychological reward-error models [9,10]. Components of these models have been proposed to correspond to neurophysiological signals [1,2,8,11,12,13,14]. The firing of midbrain dopaminergic neurons closely matches d

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call