A neural network model combining the successor representation and actor-critic methods reveals effective biological use of the representation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

In learning goal-directed behavior, state representation is important for adapting to the environment and achieving goals. A predictive state representation called successive representation (SR) has recently attracted attention as a candidate for state representation in animal brains, especially in the hippocampus. The relationship between the SR and the animal brain has been studied, and several neural network models for computing the SR have been proposed based on the findings. However, studies on implementation of the SR involving action selection have not yet advanced significantly. Therefore, we explore possible mechanisms by which the SR is utilized biologically for action selection and learning optimal action policies. The actor-critic architecture is a promising model of animal behavioral learning in terms of its correspondence to the anatomy and function of the basal ganglia, so it is suitable for our purpose. In this study, we construct neural network models for behavioral learning using the SR. By using them to perform reinforcement learning, we investigate their properties. Specifically, we investigated the effect of using different state representations for the actor and critic in the actor-critic method, and also compared the actor-critic method with Q-learning and SARSA. We found the difference between the effect of using the SR for the actor and the effect of using the SR for the critic in the actor-critic method, and observed that using the SR in conjunction with one-hot encoding makes it possible to learn with the benefits of both representations. These results suggest the possibility that the striatum can learn using multiple state representations complementarily.

Similar Papers
  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-29551-6_43
State Representation Learning for Minimax Deep Deterministic Policy Gradient
  • Jan 1, 2019
  • Dapeng Hu + 3 more

Recently, the reinforcement learning of multi-agent has been developed rapidly, especially the Minimax Deep Deterministic Policy Gradient (M3DDPG) algorithm which improves agent robustness and solves the problem that agents trained by deep reinforcement learning (DRL) are often vulnerable and sensitive to the training environment. However, agents in the real environment may not be able to perceive certain important characteristics of the environment because of their limited perceptual capabilities. So Agents often fail to achieve the desired results. In this paper, we propose a novel algorithm State Representation Learning for Minimax Deep Deterministic Policy Gradient (SRL_M3DDPG) that combines M3DDPG with the state representation learning neural network model to extract the important characteristics of raw data. And we optimize the actor and critic network by using the neural network model of state representation learning. Then the actor and critic network learn from the state representation model instead of the raw observations. Simulation experiments show that the algorithm improves the final result.

  • Research Article
  • 10.6100/ir709305
Learning models in interdependence situations
  • Nov 18, 2015
  • Willem Horst + 1 more

Many approaches to learning in games fall into one of two broad classes: reinforcement and belief learning models. Reinforcement learning assumes that successful past actions have a higher probability to be played in the future. Belief learning assumes that players have beliefs about which action the opponent(s) will choose and that players determine their own choice of action by finding the action with the highest payoff given the beliefs about the actions of others. Belief learning and (a specific type of) reinforcement learning are special cases of a hybrid learning model called Experience Weighted Attraction (EWA). Some previous studies explicitly state that it is difficult to determine the underlying process (either reinforcement learning, belief learning, or something else) that generated the data for several games. This leads to the main question of this thesis: Can we distinguish between different types of EWA-based learning, with reinforcement and belief learning as special cases, in repeated 2 x 2 games? In Chapter 2 we derive predictions for behavior in three types of games using the EWA learning model using the concept of stability: there is a large probability that all players will make the same choice in round t +1 as in t. Herewith, we conclude that belief and reinforcement learning can be distinguished, even in 2 x 2 games. Maximum differentiation in behavior resulting from either belief or reinforcement learning is obtained in games with pure Nash equilibria with negative payoffs and at least one other strategy combination with only positive payoffs. Our results help researchers to identify games in which belief and reinforcement learning can be discerned easily. Our theoretical results imply that the learning models can be distinguished after a sufficient number of rounds have been played, but it is not clear how large that number needs to be. It is also not clear how likely it is that stability actually occurs in game play. Thereto, we also examine the main question by simulating data from learning models in Chapter 3. We use the same three types of 2 x 2 games as before and investigate whether we can discern between reinforcement and belief learning in an experimental setup. Our conclusion is that this is also possible, especially in games with positive payoffs and in the repeated Prisoner’s Dilemma game, even when the repeated game has a relatively small number of rounds. We also show that other characteristics of the players’ behavior, such as the number of times a player changes strategy and the number of strategy combinations the player uses, can help differentiate between the two learning models. So far, we only considered "pure" belief and "pure" reinforcement learning, and nothing in between. For Chapter 4, we therefore consider a broader class of learning models and we try to find under which conditions, we can re-estimate three parameters of EWA learning model from simulated data, generated for different games and scenarios. The results show low rates of convergence of the estimation algorithm, and even if the algorithm converges then biased estimates of the parameters are obtained most of the time. Hence, we must conclude that re-estimating the exact parameters in a quantitative manner is difficult in most experimental setups. However, qualitatively we can find patterns that pinpoint in the direction of either belief or reinforcement learning. Finally, in the last chapter, we study the effect of a player’s social preferences on his own payoff in 2 x 2 games with only a mixed strategy equilibrium, under the assumption that the other player has no social preferences. We model social preferences with the Fehr-Schmidt inequity aversion model, which contains parameters for "envy" and "spite". Eighteen different mixed equilibrium games are identified that can be classified into Regret games, Risk games, and RiskRegret games, with six games in each class. The effects of envy and spite in these games are studied in five different status scenarios in which the player with social preferences receives much higher, mostly higher, about equal, mostly lower, or much lower payoffs. The theoretical and simulation results reveal that the effects of social preferences are variable across scenarios and games, even within scenario-game combinations. However, we can conclude that the effects of envy and spite are analogous, on average beneficial to the player with the social preferences, and most positive when the payoffs are about equal and in Risk games.

  • Research Article
  • 10.4233/uuid:f8faacb0-9a55-453d-97fd-0388a3c848ee
Sample effficient deep reinforcement learning for control
  • Dec 15, 2019
  • Tim De Bruin

The arrival of intelligent, general-purpose robots that can learn to perform new tasks autonomously has been promised for a long time now. Deep reinforcement learning, which combines reinforcement learning with deep neural network function approximation, has the potential to enable robots to learn to perform a wide range of new tasks while requiring very little prior knowledge or human help. This framework might therefore help to finally make general purpose robots a reality. However, the biggest successes of deep reinforcement learning have so far been in simulated game settings. To translate these successes to the real world, significant improvements are needed in the ability of these methods to learn quickly and safely. This thesis investigates what is needed to make this possible and makes contributions towards this goal. <br/><br/>Before deep reinforcement learning methods can be successfully applied in the robotics domain, an understanding is needed of how, when, and why deep learning and reinforcement learning work well together. This thesis therefore starts with a literature review, which is presented in Chapter 2. While the field is still in some regards in its infancy, it can already be noted that there are important components that are shared by successful algorithms. These components help to reconcile the differences between classical reinforcement learning methods and the training procedures used to successfully train deep neural networks. The main challenges in combining deep learning with reinforcement learning center around the interdependencies of the policy, the training data, and the training targets. Commonly used tools for managing the detrimental effects caused by these interdependencies include target networks, trust region updates, and experience replay buffers. Besides reviewing these components, a number of the more popular and historically relevant deep reinforcement learning methods are discussed.<br/><br/>Reinforcement learning involves learning through trial and error. However, robots (and their surroundings) are fragile, which makes these trials---and especially errors---very costly. Therefore, the amount of exploration that is performed will often need to be drastically reduced over time, especially once a reasonable behavior has already been found. We demonstrate how, using common experience replay techniques, this can quickly lead to forgetting previously learned successful behaviors. This problem is investigated in Chapter 3. Experiments are conducted to investigate what distribution of the experiences over the state-action space leads to desirable learning behavior and what distributions can cause problems. It is shown how actor-critic algorithms are especially sensitive to the lack of diversity in the action space that can result form reducing the amount of exploration over time. Further relations between the properties of the control problem at hand and the required data distributions are also shown. These include a larger need for diversity in the action space when control frequencies are high and a reduced importance of data diversity for problems where generalizing the control strategy across the state-space is more difficult.<br/><br/>While Chapter 3 investigates what data distributions are most beneficial, Chapter 4 instead proposes practical algorithms to {select} useful experiences from a stream of experiences. We do not assume to have any control over the stream of experiences, which makes it possible to learn from additional sources of experience like other robots, experiences obtained while learning different tasks, and experiences obtained using predefined controllers. We make two separate judgments on the utility of individual experiences. The first judgment is on the long term utility of experiences, which is used to determine which experiences to keep in memory once the experience buffer is full. The second judgment is on the instantaneous utility of the experience to the learning agent. This judgment is used to determine which experiences should be sampled from the buffer to be learned from. To estimate the short and long term utility of the experiences we propose proxies based on the age, surprise, and the exploration intensity associated with the experiences. It is shown how prior knowledge of the control problem at hand can be used to decide which proxies to use. We additionally show how the knowledge of the control problem can be used to estimate the optimal size of the experience buffer and whether or not to use importance sampling to compensate for the bias introduced by the selection procedure. Together, these choices can lead to a more stable learning procedure and better performing controllers. <br/><br/>In Chapter 5 we look at what to learn form the collected data. The high price of data in the robotics domain makes it crucial to extract as much knowledge as possible from each and every datum. Reinforcement learning, by default, does not do so. We therefore supplement reinforcement learning with explicit state representation learning objectives. These objectives are based on the assumption that the neural network controller that is to be learned can be seen as consisting of two consecutive parts. The first part (referred to as the state encoder) maps the observed sensor data to a compact and concise representation of the state of the robot and its environment. The second part determines which actions to take based on this state representation. As the representation of the state of the world is useful for more than just completing the task at hand, it can also be trained with more general (state representation learning) objectives than just the reinforcement learning objective associated with the current task. We show how including these additional training objectives allows for learning a much more general state representation, which in turn makes it possible to learn broadly applicable control strategies more quickly. We also introduce a training method that ensures that the added learning objectives further the goal of reinforcement learning, without destabilizing the learning process through their changes to the state encoder. <br/><br/>The final contribution of this thesis, presented in Chapter 6, focuses on the optimization procedure used to train the second part of the policy; the mapping from the state representation to the actions. While we show that the state encoder can be efficiently trained with standard gradient-based optimization techniques, perfecting this second mapping is more difficult. Obtaining high quality estimates of the gradients of the policy performance with respect to the parameters of this part of the neural network is usually not feasible. This means that while a reasonable policy can be obtained relatively quickly using gradient-based optimization approaches, this speed comes at the cost of the stability of the learning process as well as the final performance of the controller. Additionally, the unstable nature of this learning process brings with it an extreme sensitivity to the values of the hyper-parameters of the training method. This places an unfortunate emphasis on hyper-parameter tuning for getting deep reinforcement learning algorithms to work well. Gradient-free optimization algorithms can be more simple and stable, but tend to be much less sample efficient. We show how the desirable aspects of both methods can be combined by first training the entire network through gradient-based optimization and subsequently fine-tuning the final part of the network in a gradient-free manner. We demonstrate how this enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization.<br/>

  • Preprint Article
  • 10.31234/osf.io/s58jh_v1
A Reinforcement Learning and Decision-Making framework for understanding Mental Disorders
  • Apr 17, 2025
  • Jeung-Hyun Lee + 4 more

While mental disorders are complex and characterized by heterogeneous symptoms, a unified framework that fully and mechanistically captures these complexities remains elusive. Reinforcement learning offers a promising way to understand mental health by mathematically modeling the decision-making processes that underlie psychiatric conditions. By breaking decision-making down into key components—such as state representation, valuation, action selection, and outcome evaluation—reinforcement learning provides a structured approach to studying how disruptions in these processes contribute to disorders like depression, anxiety, and addiction. This review explores how reinforcement learning can help clarify the cognitive and neural mechanisms involved in mental disorders and offers insights into their interactions with other psychological and physiological systems. We also discuss the potential of the framework to improve clinical practice. Future directions will focus on expanding and using the reinforcement learning models to naturalistic paradigms and incorporation with advanced technologies like artificial intelligence.

  • Research Article
  • 10.3389/conf.fncom.2011.52.00019
\title{Connectionist model of action learning and naming}
  • Jan 1, 2011
  • Frontiers in Computational Neuroscience
  • Farkas Igor

\title{Connectionist model of action learning and naming}

  • Research Article
  • Cite Count Icon 8
  • 10.1109/tcds.2020.3035778
Behavior Decision of Mobile Robot With a Neurophysiologically Motivated Reinforcement Learning Model
  • Nov 4, 2020
  • IEEE Transactions on Cognitive and Developmental Systems
  • Dongshu Wang + 4 more

Online model-free reinforcement learning (RL) approaches play a crucial role in coping with the real-world applications, such as the behavioral decision making in robotics. How to balance the exploration and exploitation processes is a central problem in RL. A balanced ratio of exploration/exploitation has a great influence on the total learning time and the quality of the learned strategy. Therefore, various action selection policies have been presented to obtain a balance between the exploration and exploitation procedures. However, these approaches are rarely, automatically, and dynamically regulated to the environment variations. One of the most amazing self-adaptation mechanisms in animals is their capacity to dynamically switch between exploration and exploitation strategies. This article proposes a novel neurophysiologically motivated model which simulates the role of medial prefrontal cortex (MPFC) and lateral prefrontal cortex (LPFC) in behavior decision. The sensory input is transmitted to the MPFC, then the ventral tegmental area (VTA) receives a reward and calculates a dopaminergic reinforcement signal, and the feedback categorization neurons in anterior cingulate cortex (ACC) calculate the vigilance according to the dopaminergic reinforcement signal. Then the vigilance is transformed to LPFC to regulate the exploration rate, finally the exploration rate is transmitted to thalamus to calculate the corresponding action probability. This action selection mechanism is introduced to the actor–critic model of the basal ganglia, combining with the cerebellum model based on the developmental network to construct a new hybrid neuromodulatory model to select the action of the agent. Both the simulation comparison with other four traditional action selection policies and the physical experiment results demonstrate the potential of the proposed neuromodulatory model in action selection.

  • Research Article
  • Cite Count Icon 31
  • 10.1007/s10458-021-09497-8
Playing Atari with few neurons
  • Apr 19, 2021
  • Autonomous Agents and Multi-Agent Systems
  • Giuseppe Cuccu + 2 more

We propose a new method for learning compact state representations and policies separately but simultaneously for policy approximation in vision-based applications such as Atari games. Approaches based on deep reinforcement learning typically map pixels directly to actions to enable end-to-end training. Internally, however, the deep neural network bears the responsibility of both extracting useful information and making decisions based on it, two objectives which can be addressed independently. Separating the image processing from the action selection allows for a better understanding of either task individually, as well as potentially finding smaller policy representations which is inherently interesting. Our approach learns state representations using a compact encoder based on two novel algorithms: (i) Increasing Dictionary Vector Quantization builds a dictionary of state representations which grows in size over time, allowing our method to address new observations as they appear in an open-ended online-learning context; and (ii) Direct Residuals Sparse Coding encodes observations in function of the dictionary, aiming for highest information inclusion by disregarding reconstruction error and maximizing code sparsity. As the dictionary size increases, however, the encoder produces increasingly larger inputs for the neural network; this issue is addressed with a new variant of the Exponential Natural Evolution Strategies algorithm which adapts the dimensionality of its probability distribution along the run. We test our system on a selection of Atari games using tiny neural networks of only 6 to 18 neurons (depending on each game’s controls). These are still capable of achieving results that are not much worse, and occasionally superior, to the state-of-the-art in direct policy search which uses two orders of magnitude more neurons.

  • Research Article
  • 10.1176/appi.neuropsych.11020038
Mental Practice: A Psychotherapy to Improve Action-Selection in Obsessive-Compulsive Disorder
  • Jan 1, 2012
  • The Journal of Neuropsychiatry and Clinical Neurosciences
  • Sareh Zendehrouh + 2 more

Mental Practice: A Psychotherapy to Improve Action-Selection in Obsessive-Compulsive Disorder

  • Components
  • Cite Count Icon 7
  • 10.1371/journal.pcbi.1008317.r004
Reward-predictive representations generalize across tasks in reinforcement learning
  • Oct 15, 2020
  • Lucas Lehnert + 3 more

In computer science, reinforcement learning is a powerful framework with which artificial agents can learn to maximize their performance for any given Markov decision process (MDP). Advances over the last decade, in combination with deep neural networks, have enjoyed performance advantages over humans in many difficult task settings. However, such frameworks perform far less favorably when evaluated in their ability to generalize or transfer representations across different tasks. Existing algorithms that facilitate transfer typically are limited to cases in which the transition function or the optimal policy is portable to new contexts, but achieving “deep transfer” characteristic of human behavior has been elusive. Such transfer typically requires discovery of abstractions that permit analogical reuse of previously learned representations to superficially distinct tasks. Here, we demonstrate that abstractions that minimize error in predictions of reward outcomes generalize across tasks with different transition and reward functions. Such reward-predictive representations compress the state space of a task into a lower dimensional representation by combining states that are equivalent in terms of both the transition and reward functions. Because only state equivalences are considered, the resulting state representation is not tied to the transition and reward functions themselves and thus generalizes across tasks with different reward and transition functions. These results contrast with those using abstractions that myopically maximize reward in any given MDP and motivate further experiments in humans and animals to investigate if neural and cognitive systems involved in state representation perform abstractions that facilitate such equivalence relations.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 30
  • 10.1371/journal.pcbi.1008317
Reward-predictive representations generalize across tasks in reinforcement learning.
  • Oct 15, 2020
  • PLOS Computational Biology
  • Lucas Lehnert + 2 more

In computer science, reinforcement learning is a powerful framework with which artificial agents can learn to maximize their performance for any given Markov decision process (MDP). Advances over the last decade, in combination with deep neural networks, have enjoyed performance advantages over humans in many difficult task settings. However, such frameworks perform far less favorably when evaluated in their ability to generalize or transfer representations across different tasks. Existing algorithms that facilitate transfer typically are limited to cases in which the transition function or the optimal policy is portable to new contexts, but achieving "deep transfer" characteristic of human behavior has been elusive. Such transfer typically requires discovery of abstractions that permit analogical reuse of previously learned representations to superficially distinct tasks. Here, we demonstrate that abstractions that minimize error in predictions of reward outcomes generalize across tasks with different transition and reward functions. Such reward-predictive representations compress the state space of a task into a lower dimensional representation by combining states that are equivalent in terms of both the transition and reward functions. Because only state equivalences are considered, the resulting state representation is not tied to the transition and reward functions themselves and thus generalizes across tasks with different reward and transition functions. These results contrast with those using abstractions that myopically maximize reward in any given MDP and motivate further experiments in humans and animals to investigate if neural and cognitive systems involved in state representation perform abstractions that facilitate such equivalence relations.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1007/s11081-021-09687-z
Improving the efficiency of reinforcement learning for a spacecraft powered descent with Q-learning
  • Oct 4, 2021
  • Optimization and Engineering
  • Callum Wilson + 1 more

Reinforcement learning entails many intuitive and useful approaches to solving various problems. Its main premise is to learn how to complete tasks by interacting with the environment and observing which actions are more optimal with respect to a reward signal. Methods from reinforcement learning have long been applied in aerospace and have more recently seen renewed interest in space applications. Problems in spacecraft control can benefit from the use of intelligent techniques when faced with significant uncertainties—as is common for space environments. Solving these control problems using reinforcement learning remains a challenge partly due to long training times and sensitivity in performance to hyperparameters which require careful tuning. In this work we seek to address both issues for a sample spacecraft control problem. To reduce training times compared to other approaches, we simplify the problem by discretising the action space and use a data-efficient algorithm to train the agent. Furthermore, we employ an automated approach to hyperparameter selection which optimises for a specified performance metric. Our approach is tested on a 3-DOF powered descent problem with uncertainties in the initial conditions. We run experiments with two different problem formulations—using a ‘shaped’ state representation to guide the agent and also a ‘raw’ state representation with unprocessed values of position, velocity and mass. The results show that an agent can learn a near-optimal policy efficiently by appropriately defining the action-space and state-space. Using the raw state representation led to ‘reward-hacking’ and poor performance, which highlights the importance of the problem and state-space formulation in successfully training reinforcement learning agents. In addition, we show that the optimal hyperparameters can vary significantly based on the choice of loss function. Using two sets of hyperparameters optimised for different loss functions, we demonstrate that in both cases the agent can find near-optimal policies with comparable performance to previously applied methods.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-64580-9_26
State Representation Learning from Demonstration
  • Jan 1, 2020
  • Astrid Merckling + 4 more

Robots could learn their own state and world representation from perception\nand experience without supervision. This desirable goal is the main focus of\nour field of interest, state representation learning (SRL). Indeed, a compact\nrepresentation of such a state is beneficial to help robots grasp onto their\nenvironment for interacting. The properties of this representation have a\nstrong impact on the adaptive capability of the agent. In this article we\npresent an approach based on imitation learning. The idea is to train several\npolicies that share the same representation to reproduce various\ndemonstrations. To do so, we use a multi-head neural network with a shared\nstate representation feeding a task-specific agent. If the demonstrations are\ndiverse, the trained representation will eventually contain the information\nnecessary for all tasks, while discarding irrelevant information. As such, it\nwill potentially become a compact state representation useful for new tasks. We\ncall this approach SRLfD (State Representation Learning from Demonstration).\nOur experiments confirm that when a controller takes SRLfD-based\nrepresentations as input, it can achieve better performance than with other\nrepresentation strategies and promote more efficient reinforcement learning\n(RL) than with an end-to-end RL strategy.\n

  • Research Article
  • Cite Count Icon 128
  • 10.1016/j.conb.2011.04.001
Multiple representations and algorithms for reinforcement learning in the cortico-basal ganglia circuit
  • Apr 29, 2011
  • Current Opinion in Neurobiology
  • Makoto Ito + 1 more

Multiple representations and algorithms for reinforcement learning in the cortico-basal ganglia circuit

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-030-46133-1_39
Manufacturing Dispatching Using Reinforcement and Transfer Learning
  • Jan 1, 2020
  • Shuai Zheng + 2 more

Efficient dispatching rule in manufacturing industry is key to ensure product on-time delivery and minimum past-due and inventory cost. Manufacturing, especially in the developed world, is moving towards on-demand manufacturing meaning a high mix, low volume product mix. This requires efficient dispatching that can work in dynamic and stochastic environments, meaning it allows for quick response to new orders received and can work over a disparate set of shop floor settings. In this paper we address this problem of dispatching in manufacturing. Using reinforcement learning (RL), we propose a new design to formulate the shop floor state as a 2-D matrix, incorporate job slack time into state representation, and design lateness and tardiness rewards function for dispatching purpose. However, maintaining a separate RL model for each production line on a manufacturing shop floor is costly and often infeasible. To address this, we enhance our deep RL model with an approach for dispatching policy transfer. This increases policy generalization and saves time and cost for model training and data collection. Experiments show that: (1) our approach performs the best in terms of total discounted reward and average lateness, tardiness, (2) the proposed policy transfer approach reduces training time and increases policy generalization.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/bf03037599
Cooperation of categorical and behavioral learning in a practical solution to the abstraction problem
  • Sep 1, 2001
  • New Generation Computing
  • Atsushi Ueno + 1 more

Real robots should be able to adapt autonomously to various environments in order to go on executing their tasks without breaking down. They achieve this by learning how to abstract only useful information from a huge amount of information in the environment while executing their tasks. This paper proposes a new architecture which performs categorical learning and behavioral learning in parallel with task execution. We call the architectureSituation Transition Network System (STNS). In categorical learning, it makes a flexible state representation and modifies it according to the results of behaviors. Behavioral learning is reinforcement learning on the state representation. Simulation results have shown that this architecture is able to learn efficiently and adapt to unexpected changes of the environment autonomously.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.