Signatures of reinforcement learning in natural behavior.
Across myriad, real-world contexts, we encounter the challenge of learning to take actions that bring about desirable outcomes. The theoretical framework of reinforcement learning proposes formal algorithms through which agents learn from experience to make rewarding choices. These formal models capture many aspects of reward-guided human behavior in controlled laboratory contexts. Here, we suggest that the constructs (i.e., states, actions, and rewards) and algorithms formalized within reinforcement learning theory can be operationally defined and extended to additionally account for learning in complex, natural environments. We discuss several recent examples of empirical studies that provide evidence of signatures of reinforcement learning across diverse human behaviors in everyday environments.
- Components
7
- 10.1371/journal.pcbi.1009070.r004
- Jun 3, 2021
Classic reinforcement learning (RL) theories cannot explain human behavior in the absence of external reward or when the environment changes. Here, we employ a deep sequential decision-making paradigm with sparse reward and abrupt environmental changes. To explain the behavior of human participants in these environments, we show that RL theories need to include surprise and novelty, each with a distinct role. While novelty drives exploration before the first encounter of a reward, surprise increases the rate of learning of a world-model as well as of model-free action-values. Even though the world-model is available for model-based RL, we find that human decisions are dominated by model-free action choices. The world-model is only marginally used for planning, but it is important to detect surprising events. Our theory predicts human action choices with high probability and allows us to dissociate surprise, novelty, and reward in EEG signals.
- Research Article
34
- 10.1371/journal.pcbi.1009070
- Jun 3, 2021
- PLOS Computational Biology
Classic reinforcement learning (RL) theories cannot explain human behavior in the absence of external reward or when the environment changes. Here, we employ a deep sequential decision-making paradigm with sparse reward and abrupt environmental changes. To explain the behavior of human participants in these environments, we show that RL theories need to include surprise and novelty, each with a distinct role. While novelty drives exploration before the first encounter of a reward, surprise increases the rate of learning of a world-model as well as of model-free action-values. Even though the world-model is available for model-based RL, we find that human decisions are dominated by model-free action choices. The world-model is only marginally used for planning, but it is important to detect surprising events. Our theory predicts human action choices with high probability and allows us to dissociate surprise, novelty, and reward in EEG signals.
- Book Chapter
- 10.1017/9781108755610.013
- May 11, 2023
Reinforcement learning (RL) is a computational framework for an active agent to learn behaviors on the basis of a scalar reward feedback. The theory of reinforcement learning was developed in the artificial intelligence community with intuitions from psychology and animal learning theory and mathematical basis in control theory. It has been successfully applied to tasks like game playing and robot control. Reinforcement learning gives a theoretical account of behavioral learning in humans and animals and underlying brain mechanisms, such as dopamine signaling and the basal ganglia circuit. Reinforcement learning serves as the “common language” for engineers, biologists, and cognitive scientists to exchange their problems and findings in goal-directed behaviors. This chapter introduces the basic theoretical framework of reinforcement learning and reviews its impacts in artificial intelligence, neuroscience, and cognitive science.
- Research Article
6
- 10.1016/j.neucom.2012.04.026
- Sep 29, 2012
- Neurocomputing
Modeling error detection in human brain: A preliminary unification of reinforcement learning and conflict monitoring theories
- Conference Article
- 10.1109/cieec50170.2021.9510406
- May 28, 2021
Multi-agent system (MAS) based on reinforcement learning (RL) theory is an important means to design and evaluate electricity market rules. It can simulate operations of electricity market and activities of market subjects better. RL is a learning algorithm that maps uncertain information to behaviors through the perception of intelligent system environment. MAS is a complex system that transforms the economic model into a system composed of interacting agents. The research of reinforcement learning based multi-agent system (RLMAS) and its application has been paid more and more attention in the field of electricity market simulation. This paper first introduces basic ideas and algorithms of RLMAS, and then summarizes the main applications of RLMAS theory in market operation, strategy choices of market agents, electricity price forecasting and energy internet system. Finally, the prospect of its application in electricity market is prospected.
- Research Article
7
- 10.3390/systems11020083
- Feb 6, 2023
- Systems
To more effectively solve the complex optimization problems that exist in nonlinear, high-dimensional, large-sample and complex systems, many intelligent optimization methods have been proposed. Among these algorithms, the particle swarm optimization (PSO) algorithm has attracted scholars’ attention. However, the traditional PSO can easily become an individual optimal solution, leading to the transition of the optimization process from global exploration to local development. To solve this problem, in this paper, we propose a Hybrid Reinforcement Learning Particle Swarm Algorithm (HRLPSO) based on the theory of reinforcement learning in psychology. First, we used the reinforcement learning strategy to optimize the initial population in the population initialization stage; then, chaotic adaptive weights and adaptive learning factors were used to balance the global exploration and local development process, and the individual optimal solution and the global optimal solution were obtained using dimension learning. Finally, the improved reinforcement learning strategy and mutation strategy were applied to the traditional PSO to improve the quality of the individual optimal solution and the global optimal solution. The HRLPSO algorithm was tested by optimizing the solution of 12 benchmarks as well as the CEC2013 test suite, and the results show it can balance the individual learning ability and social learning ability, verifying its effectiveness.
- Research Article
118
- 10.2976/1.2732246/10.2976/1
- May 1, 2007
- HFSP Journal
Reinforcement learning is a computational framework for an active agent to learn behaviors on the basis of a scalar reward signal. The agent can be an animal, a human, or an artificial system such as a robot or a computer program. The reward can be food, water, money, or whatever measure of the performance of the agent. The theory of reinforcement learning, which was developed in an artificial intelligence community with intuitions from animal learning theory, is now giving a coherent account on the function of the basal ganglia. It now serves as the "common language" in which biologists, engineers, and social scientists can exchange their problems and findings. This article reviews the basic theoretical framework of reinforcement learning and discusses its recent and future contributions toward the understanding of animal behaviors and human decision making.
- Research Article
65
- 10.2976/1.2732246
- Jan 1, 2007
- HFSP Journal
Reinforcement learning is a computational framework for an active agent to learn behaviors on the basis of a scalar reward signal. The agent can be an animal, a human, or an artificial system such as a robot or a computer program. The reward can be food, water, money, or whatever measure of the performance of the agent. The theory of reinforcement learning, which was developed in an artificial intelligence community with intuitions from animal learning theory, is now giving a coherent account on the function of the basal ganglia. It now serves as the "common language" in which biologists, engineers, and social scientists can exchange their problems and findings. This article reviews the basic theoretical framework of reinforcement learning and discusses its recent and future contributions toward the understanding of animal behaviors and human decision making.
- Research Article
11
- 10.1371/journal.pcbi.1006518
- Oct 25, 2018
- PLOS Computational Biology
Although a standard reinforcement learning model can capture many aspects of reward-seeking behaviors, it may not be practical for modeling human natural behaviors because of the richness of dynamic environments and limitations in cognitive resources. We propose a modular reinforcement learning model that addresses these factors. Based on this model, a modular inverse reinforcement learning algorithm is developed to estimate both the rewards and discount factors from human behavioral data, which allows predictions of human navigation behaviors in virtual reality with high accuracy across different subjects and with different tasks. Complex human navigation trajectories in novel environments can be reproduced by an artificial agent that is based on the modular model. This model provides a strategy for estimating the subjective value of actions and how they influence sensory-motor decisions in natural behavior.
- Research Article
34
- 10.1523/jneurosci.6421-10.2011
- Apr 27, 2011
- The Journal of Neuroscience
Reinforcement learning theory has generated substantial interest in neurobiology, particularly because of the resemblance between phasic dopamine and reward prediction errors. Actor-critic theories have been adapted to account for the functions of the striatum, with parts of the dorsal striatum equated to the actor. Here, we specifically test whether the human dorsal striatum--as predicted by an actor-critic instantiation--is used on a trial-to-trial basis at the time of choice to choose in accordance with reinforcement learning theory, as opposed to a competing strategy: the gambler's fallacy. Using a partial-brain functional magnetic resonance imaging scanning protocol focused on the striatum and other ventral brain areas, we found that the dorsal striatum is more active when choosing consistent with reinforcement learning compared with the competing strategy. Moreover, an overlapping area of dorsal striatum along with the ventral striatum was found to be correlated with reward prediction errors at the time of outcome, as predicted by the actor-critic framework. These findings suggest that the same region of dorsal striatum involved in learning stimulus-response associations may contribute to the control of behavior during choice, thereby using those learned associations. Intriguingly, neither reinforcement learning nor the gambler's fallacy conformed to the optimal choice strategy on the specific decision-making task we used. Thus, the dorsal striatum may contribute to the control of behavior according to reinforcement learning even when the prescriptions of such an algorithm are suboptimal in terms of maximizing future rewards.
- Research Article
8
- 10.1523/jneurosci.0752-22.2022
- Jan 20, 2023
- The Journal of Neuroscience
In reinforcement learning (RL), animals choose by assigning values to options and learn by updating these values from reward outcomes. This framework has been instrumental in identifying fundamental learning variables and their neuronal implementations. However, canonical RL models do not explain how reward values are constructed from biologically critical intrinsic reward components, such as nutrients. From an ecological perspective, animals should adapt their foraging choices in dynamic environments to acquire nutrients that are essential for survival. Here, to advance the biological and ecological validity of RL models, we investigated how (male) monkeys adapt their choices to obtain preferred nutrient rewards under varying reward probabilities. We found that the nutrient composition of rewards strongly influenced learning and choices. Preferences of the animals for specific nutrients (sugar, fat) affected how they adapted to changing reward probabilities; the history of recent rewards influenced choices of the monkeys more strongly if these rewards contained the their preferred nutrients (nutrient-specific reward history). The monkeys also chose preferred nutrients even when they were associated with lower reward probability. A nutrient-sensitive RL model captured these processes; it updated the values of individual sugar and fat components of expected rewards based on experience and integrated them into subjective values that explained the choices of the monkeys. Nutrient-specific reward prediction errors guided this value-updating process. Our results identify nutrients as important reward components that guide learning and choice by influencing the subjective value of choice options. Extending RL models with nutrient-value functions may enhance their biological validity and uncover nutrient-specific learning and decision variables.SIGNIFICANCE STATEMENT RL is an influential framework that formalizes how animals learn from experienced rewards. Although reward is a foundational concept in RL theory, canonical RL models cannot explain how learning depends on specific reward properties, such as nutrients. Intuitively, learning should be sensitive to the nutrient components of the reward to benefit health and survival. Here, we show that the nutrient (fat, sugar) composition of rewards affects how the monkeys choose and learn in an RL paradigm and that key learning variables including reward history and reward prediction error should be modified with nutrient-specific components to account for the choice behavior observed in the monkeys. By incorporating biologically critical nutrient rewards into the RL framework, our findings help advance the ecological validity of RL models.
- Research Article
- 10.29173/aar49
- Sep 10, 2019
- Alberta Academic Review
Artificial agents have often been compared to humans in their ability to categorize images or play strategic games. However, comparisons between human and artificial agents are frequently based on the overall performance on a particular task, and not necessarily on the specifics of how each agent behaves. In this study, we directly compared human behaviour with a reinforcement learning (RL) model. Human participants and an RL agent navigated through different grid world environments with high- and low- value targets. The artificial agent consisted of a deep neural network trained to map pixel input of a 27x27 grid world into cardinal directions using RL. An epsilon greedy policy was used to maximize reward. Behaviour of both agents was evaluated on four different conditions. Results showed both humans and RL agents consistently chose the higher reward over a lower reward, demonstrating an understanding of the task. Though both humans and RL agents consider movement cost for reward, the machine agent considers the movement costs more, trading off the effort with reward differently than humans. We found humans and RL agents both consider long-term rewards as they navigate through the world, yet unlike humans, the RL model completely disregards limitations in movements (e.g. how many total moves received). Finally, we rotated pseudorandom grid arrangements to study how decisions change with visual differences. We unexpectedly found that the RL agent changed its behaviour due to visual rotations, yet remained less variable than humans. Overall, the similarities between humans and the RL agent shows the potential RL agents have of being an adequate model of human behaviour. Additionally, the differences between human and RL agents suggest improvements to RL methods that may improve their performance. This research compares the human mind with artificial intelligence, creating the opportunity for future innovation.
- Research Article
187
- 10.1371/journal.pcbi.1003024
- Apr 11, 2013
- PLoS Computational Biology
Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.
- Book Chapter
- 10.1007/978-3-030-60577-3_17
- Oct 2, 2020
The research of W Schultz in the late 1980s and early 1990s of the effect of uncertainty in reward delivery in the behavioral experiments with monkey on the release of dopamine by dopaminergic structures of the midbrain [1, 2] highlighted the analogy between the amount of phasic dopamine release by dopaminergic structures and the reward prediction error of the RL theory [3]. Since that the functioning of the cortex-basal ganglia system is analysed as a possible Reinforcement learning (RL) [4] network. This system is an array of partly connected parallel loops. The basal ganglia is divided into dorsal and ventral subdivisions. In accordance with their functions we can further distinguish four parts in it: dorsolateral striatum, dorsomedial striatum, nucleus accumbens core, nucleus accumbens medial shell. The part of the whole cerebral cortex-basal ganglia system with a center in the dorsolateral striatum may represent action a, used in RL theory, those with the center in dorsomedial striatum represent action value Q(s,a,), with nucleus accumbens core contain state value V(s), the part of this system based on nucleus accumbens medial shell calculates policy π, but in different way than RL theory does
- Research Article
- 10.1287/moor.2022.0216
- Nov 6, 2024
- Mathematics of Operations Research
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a data set collected a priori. Because of the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the data set, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the data set (e.g., finite concentratability coefficients or uniformly lower-bounded densities of visitation measures), we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also, minimax optimal. In particular, given the data set, the learned policy serves as the “best effort” among all policies as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which arises from the “irrelevant” trajectories that are less covered by the data set and not informative for the optimal policy. Funding: Z. Yang acknowledges the Simons Institute (Theory of Reinforcement Learning). Z. Wang acknowledges the National Science Foundation [Awards 2048075, 2008827, 2015568, and 1934931], the Simons Institute (Theory of Reinforcement Learning), Amazon, J. P. Morgan, and Two bSigma.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.