Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Nicolas Frémaux,Wulfram Gerstner,Henning Sprekeler,Lyle J Graham

doi:10.1371/journal.pcbi.1003024

Abstract

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

Highlights

Many instances of animal behavior learning such as path finding in foraging, or – a more artificial example – navigating the Morris water-maze, can be interpreted as exploration and trialand-error learning
We propose a model explaining how reward signals might interplay with synaptic plasticity, and use the model to solve a simulated maze navigation task
Our model extends an idea from the theory of reinforcement learning: one group of neurons form an ‘‘actor,’’ responsible for choosing the direction of motion of the animal

Summary

Introduction

Many instances of animal behavior learning such as path finding in foraging, or – a more artificial example – navigating the Morris water-maze, can be interpreted as exploration and trialand-error learning. In both examples, the behavior eventually learned by the animal is the one that led to high reward. Several algorithms have been developed to solve this standard formulation of the problem, and some of these have been used with spiking neural systems These include REINFORCE [3,4] and partially observable Markov decision processes [5,6], in case the agent has incomplete knowledge of its state

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS Computational Biology	Publication Date: Apr 11, 2013
Citations: 201	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology

Lead the way for us

Similar Papers

Choice-selective sequences dominate in cortical relative to thalamic inputs to NAc to support reinforcement learning.
Nathan F Parker ... Ben Engelhard
Cell Reports | VOL. 39
Nathan F Parker, et. al.Nathan F Parker ... Ben Engelhard
01 May 2022
Cell Reports | VOL. 39

Spiking neural network model of free-energy-based reinforcement learning
Takashi Nakano ... Makoto Otsuka
BMC Neuroscience | VOL. 12
Takashi Nakano, et. al.Takashi Nakano ... Makoto Otsuka
18 Jul 2011
BMC Neuroscience | VOL. 12

Reinforcement learning: computing the temporal difference of values via distinct corticostriatal pathways
Kenji Morita ... Yasuo Kawaguchi
Trends in Neurosciences | VOL. 35
Kenji Morita, et. al.Kenji Morita ... Yasuo Kawaguchi
30 May 2012
Trends in Neurosciences | VOL. 35

Multiple Timescales of Memory in Lateral Habenula and Dopamine Neurons
Ethan S Bromberg-Martin ... Okihide Hikosaka
Neuron | VOL. 67
Ethan S Bromberg-Martin, et. al.Ethan S Bromberg-Martin ... Okihide Hikosaka
01 Aug 2010
Neuron | VOL. 67

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology