Abstract

Humans and animals face decision tasks in an uncertain multi-agent environment where an agent's strategy may change in time due to the co-adaptation of others strategies. The neuronal substrate and the computational algorithms underlying such adaptive decision making, however, is largely unknown. We propose a population coding model of spiking neurons with a policy gradient procedure that successfully acquires optimal strategies for classical game-theoretical tasks. The suggested population reinforcement learning reproduces data from human behavioral experiments for the blackjack and the inspector game. It performs optimally according to a pure (deterministic) and mixed (stochastic) Nash equilibrium, respectively. In contrast, temporal-difference(TD)-learning, covariance-learning, and basic reinforcement learning fail to perform optimally for the stochastic strategy. Spike-based population reinforcement learning, shown to follow the stochastic reward gradient, is therefore a viable candidate to explain automated decision learning of a Nash equilibrium in two-player games.

Highlights

  • Neuroeconomics is an interdisciplinary research field that tries to explain human decision making in neuronal terms

  • Multi-agent games, are not Markovian as the evolution of the environment typically does depend on the current state, and on the history and on the adaptation of the other agents. Such games can be described as partially observable Markov decision processes (POMDP, [6]) by embedding the sequences and the learning strategies of the other agents into a large state space

  • We have presented a policy gradient method for population reinforcement learning which, unlike temporal difference (TD)-learning, can cope with POMDPs and can be implemented in neuronal terms [7]

Read more

Summary

Introduction

Neuroeconomics is an interdisciplinary research field that tries to explain human decision making in neuronal terms. Classical models in neuroeconomics are based on temporal difference (TD) learning [1], an algorithm to maximize the total expected reward [2] with potential neuronal implementations [3,4] It assumes that the environment can be described as a Markov decision process (MDP), i.e. by a finite number of states with fixed transition probabilities [5]. Multi-agent games, are not Markovian as the evolution of the environment typically does depend on the current state, and on the history and on the adaptation of the other agents Such games can be described as partially observable Markov decision processes (POMDP, [6]) by embedding the sequences and the learning strategies of the other agents into a large state space. Maximizing one’s own payoff while assuming stationarity in the opponents strategy is called a fictitious play and conditions are studied when this play effectively converges to a stationary (Nash) equilibrium [8]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.