Improving Multiagent Actor-Critic Architectures, with Opponent Approximation and Dropout for Control

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

In the domain of reinforcement learning, solution proposals to multiagent problems are evolving.We propose a new algorithm, MADDPGX, to handle the problem of higher uncertainty created by other agents' actions by an enemy actor approximator, and we investigate the most efficient techniques of estimations.This approximation works using a neural network, which has the input of the state and the output as the action (probably preferred by the enemy agent).We also experimented with dropout, a tool commonly used for neural networks, but has not been used efficiently for reinforcement learning until now.We have also found that in multiagent actor-critic scenarios, it can improve overall performance.Generally, our contribution is the use of action approximation of adversaries and the dropout usage in actor-critic systems, with a conclusion that the newly proposed methods will perform better in zero-sum multi-agent robot system scenarios.The experiments were conducted in a multiagent predator-prey environment.

Similar Papers
  • Research Article
  • 10.6100/ir709305
Learning models in interdependence situations
  • Nov 18, 2015
  • Willem Horst + 1 more

Many approaches to learning in games fall into one of two broad classes: reinforcement and belief learning models. Reinforcement learning assumes that successful past actions have a higher probability to be played in the future. Belief learning assumes that players have beliefs about which action the opponent(s) will choose and that players determine their own choice of action by finding the action with the highest payoff given the beliefs about the actions of others. Belief learning and (a specific type of) reinforcement learning are special cases of a hybrid learning model called Experience Weighted Attraction (EWA). Some previous studies explicitly state that it is difficult to determine the underlying process (either reinforcement learning, belief learning, or something else) that generated the data for several games. This leads to the main question of this thesis: Can we distinguish between different types of EWA-based learning, with reinforcement and belief learning as special cases, in repeated 2 x 2 games? In Chapter 2 we derive predictions for behavior in three types of games using the EWA learning model using the concept of stability: there is a large probability that all players will make the same choice in round t +1 as in t. Herewith, we conclude that belief and reinforcement learning can be distinguished, even in 2 x 2 games. Maximum differentiation in behavior resulting from either belief or reinforcement learning is obtained in games with pure Nash equilibria with negative payoffs and at least one other strategy combination with only positive payoffs. Our results help researchers to identify games in which belief and reinforcement learning can be discerned easily. Our theoretical results imply that the learning models can be distinguished after a sufficient number of rounds have been played, but it is not clear how large that number needs to be. It is also not clear how likely it is that stability actually occurs in game play. Thereto, we also examine the main question by simulating data from learning models in Chapter 3. We use the same three types of 2 x 2 games as before and investigate whether we can discern between reinforcement and belief learning in an experimental setup. Our conclusion is that this is also possible, especially in games with positive payoffs and in the repeated Prisoner’s Dilemma game, even when the repeated game has a relatively small number of rounds. We also show that other characteristics of the players’ behavior, such as the number of times a player changes strategy and the number of strategy combinations the player uses, can help differentiate between the two learning models. So far, we only considered "pure" belief and "pure" reinforcement learning, and nothing in between. For Chapter 4, we therefore consider a broader class of learning models and we try to find under which conditions, we can re-estimate three parameters of EWA learning model from simulated data, generated for different games and scenarios. The results show low rates of convergence of the estimation algorithm, and even if the algorithm converges then biased estimates of the parameters are obtained most of the time. Hence, we must conclude that re-estimating the exact parameters in a quantitative manner is difficult in most experimental setups. However, qualitatively we can find patterns that pinpoint in the direction of either belief or reinforcement learning. Finally, in the last chapter, we study the effect of a player’s social preferences on his own payoff in 2 x 2 games with only a mixed strategy equilibrium, under the assumption that the other player has no social preferences. We model social preferences with the Fehr-Schmidt inequity aversion model, which contains parameters for "envy" and "spite". Eighteen different mixed equilibrium games are identified that can be classified into Regret games, Risk games, and RiskRegret games, with six games in each class. The effects of envy and spite in these games are studied in five different status scenarios in which the player with social preferences receives much higher, mostly higher, about equal, mostly lower, or much lower payoffs. The theoretical and simulation results reveal that the effects of social preferences are variable across scenarios and games, even within scenario-game combinations. However, we can conclude that the effects of envy and spite are analogous, on average beneficial to the player with the social preferences, and most positive when the payoffs are about equal and in Risk games.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/aeeca49918.2020.9213574
A Spectrum Handoff Method Based on Reinforcement and Transfer Learning
  • Aug 1, 2020
  • Jiaxing Zhao + 3 more

This paper designs a spectrum handoff method based on reinforcement and transfer learning in a cognitive radio environment. In the context of secondary users adopting reinforcement learning to form a spectrum handoff strategy, transfer learning is used to increase the convergence speed of reinforcement learning for new users. First, the original secondary user completes reinforcement learning in a radio environment. Then, the original secondary users are considered as an expert user, and the Q table obtained through reinforcement learning is transferred to the newly arrived secondary users. Finally, the new users complete their own reinforcement learning based on the Q table. Through simulation experiments, comparing the reinforcement learning convergence process of new secondary users with and without transfer learning, it is found that transfer learning can significantly improve the convergence rate of new users.

  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.patcog.2022.108872
Alleviating the estimation bias of deep deterministic policy gradient via co-regularization
  • Jun 28, 2022
  • Pattern Recognition
  • Yao Li + 3 more

Alleviating the estimation bias of deep deterministic policy gradient via co-regularization

  • Conference Article
  • Cite Count Icon 21
  • 10.1109/itsc.2010.5624977
Arterial traffic control using reinforcement learning agents and information from adjacent intersections in the state and reward structure
  • Sep 1, 2010
  • Juan C Medina + 2 more

An application that uses reinforcement learning (RL) agents for traffic control along an arterial under high traffic volumes is presented. RL agents were trained using Q learning and a modified version of the state representation that included information on the occupancy of the links from neighboring intersections. The proposed structure also includes a reward that considers potential blockage from downstream intersections (due to saturated conditions), as well as pressure to coordinate the signal response with the future arrival of traffic from upstream intersections. Experiments using microscopic simulation software were conducted for an arterial with 5 intersections under high conflicting volumes, and results were compared with the best settings of coordinated pre-timed phasing. Data showed lower delays and less number of stops with RL agents, as well as a more balanced distribution of the delay among all vehicles in the system. Evidence of coordinated-like behavior was found as the number of stops to traverse the 5 intersections was on average lower than 1.5, and also since the distribution of green times from all intersections was very similar. As traffic approached to capacity, however, delays with the pre-timed phasing were lower than with RL agents, but the agents produced lower maximum delay times and lower maximum number of stops per vehicle. Future research will analyze variable coefficients in the state and reward structures for the system to better cope with a wide variety of traffic volumes, including transitions from oversaturation to undersaturation and vice versa.

  • Conference Article
  • 10.1109/indin51773.2022.9976094
Curriculum Learning in Peristaltic Sortation Machine
  • Jul 25, 2022
  • Mohammed Sharafath Abdul Hameed + 3 more

This paper presents a novel approach to train a Reinforcement Learning (RL) agent faster for transportation of parcels in a Peristaltic Sortation Machine (PSM) using curriculum learning (CL). The PSM was developed as a means to transport parcels using an actuator and a flexible film where a RL agent is trained to control the actuator. In a previous paper, training of the actuator was done on a Discrete Element Method (DEM) simulation environment of the PSM developed using an open-source DEM library called LIGGGHTS, which reduced the training time of the transportation task compared to the real machine. But it still took days to train the agent. The objective of this paper is to reduce the training time to hours. To overcome this problem, we developed a faster but lower fidelity python simulation environment (PSE) capable of simulating the transportation task of PSM. And we used it with a curriculum learning approach to accelerate training the agent in the transportation process. The RL agent is trained in two steps in the PSE: 1. with a fixed set of goal positions, 2. with randomized goal positions. Additionally, we also use Gradient Monitoring (GM), a gradient regularization method, which provides additional trust region constraints in the policy updates of the RL agent when switching between tasks. The agent so trained is then deployed and tested in the DEM environment where the agent has not been trained before. The results obtained show that the RL agent trained using CL and PSE successfully completes the tasks in the DEM environment without any loss in performance, while using only a fraction of the training time (1.87%) per episode. This will allow for faster prototyping of algorithms to be tested on the PSM in future.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.apenergy.2024.124179
Optimal operational planning of a bio-fuelled cogeneration plant: Integration of sparse nonlinear dynamics identification and deep reinforcement learning
  • Aug 21, 2024
  • Applied Energy
  • Seyed Mohammad Asadzadeh + 1 more

This paper presents a novel data-driven approach for short-term operational planning of a cogeneration plant. The proposed methodology utilizes sparse identification of nonlinear dynamics (SINDy) to extract a dynamic model of heat generation from operational data. This model is then employed to simulate the plant dynamics during the training of a reinforcement learning (RL) agent, enabling online stochastic optimization of the production plan in real-time. The incorporation of SINDy enhances the accuracy of capturing the plant's nonlinear dynamics and significantly improves the computational speed of plant simulations, enabling efficient RL agent training within a reasonable timeframe. The performance of operational planning with the RL agent is compared to that of dynamic programming, a widely used method in the literature. The evaluation metric encompasses energy efficiency, unmet demands, and wasted heat. The comparison investigates the effectiveness of RL and dynamic programming under various scenarios with different qualities of energy demand forecasts. The RL agent exhibits robustness and notably improves the operational planning performance, particularly when faced with uncertain energy demands in the environment. Furthermore, the findings show that the RL agent, trained on a school building data, could successfully perform planning tasks for a hotel building, indicating the transferability of learned planning knowledge across different cogeneration use cases.

  • Research Article
  • Cite Count Icon 2
  • 10.3906/elk-2008-94
Relational-grid-world: a novel relational reasoning environment and an agent model for relational information extraction
  • Mar 30, 2021
  • TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES
  • Faruk Küçüksubaşi + 1 more

Reinforcement learning (RL) agents are often designed specifically for a particular problem and they generallyhave uninterpretable working processes. Statistical methods-based agent algorithms can be improved in terms ofgeneralizability and interpretability using symbolic artificial intelligence (AI) tools such as logic programming. Inthis study, we present a model-free RL architecture that is supported with explicit relational representations of theenvironmental objects. For the first time, we use the PrediNet network architecture in a dynamic decision-making problemrather than image-based tasks, and multi-head dot-product attention network (MHDPA) as a baseline for performancecomparisons. We tested two networks in two environments -i.e., the baseline box-world environment and our novelenvironment, relational-grid-world (RGW). With the procedurally generated RGW environment, which is complex interms of visual perceptions and combinatorial selections, it is easy to measure the relational representation performance ofthe RL agents. The experiments were carried out using different configurations of the environment so that the presentedmodule and the environment were compared with the baselines. We reached similar policy optimization performanceresults with the PrediNet architecture and MHDPA. Additionally, we achieved to extract the propositional representationexplicitly -which makes the agent's statistical policy logic more interpretable and tractable. This flexibility in the agent'spolicy provides convenience for designing non-task-specific agent architectures. The main contributions of this studyare two-fold -an RL agent that can explicitly perform relational reasoning, and a new environment that measures therelational reasoning capabilities of RL agents.

  • Research Article
  • Cite Count Icon 10
  • 10.2215/cjn.0000000000000084
Reinforcement Learning for Clinical Applications.
  • Feb 8, 2023
  • Clinical Journal of the American Society of Nephrology
  • Kia Khezeli + 5 more

Introduction Reinforcement learning formalizes the concept of learning from interactions.1 Broadly, reinforcement learning focuses on a setting in which an agent (decision maker) sequentially interacts with an environment that is partially unknown to them. At each stage, the agent takes an action and receives a reward. The objective of the agent is to maximize rewards accumulated in the long run. There are many situations in health care where decisions are made sequentially for which reinforcement learning approaches could prove useful for decision making. Throughout this article, we consider treatment prescription as an archetypical example to connect reinforcement learning concepts to a health care setting. In this setting, the care provider, the prescribed treatment, and the patients can be viewed as the agent, the action, and the environment, respectively, as depicted in Figure 1.Figure 1: Sequential treatment of AKI or CKD complications modeled as a reinforcement learning problem.Background In this section, with the objective of making reinforcement learning literature more accessible to a clinical audience, we briefly introduce related fundamental concepts and approaches. We refer the interested reader to Sutton and Barto1 for a comprehensive introduction to reinforcement learning. Markov Decision Processes Markov decision processes (MDPs) are a formalism of the sequential decision-making problem that has been central to the theoretical and practical advancements of reinforcement learning. In each stage of an MDP, the agent observes the state of the environment and takes an action, which, in turn, results in a change of the state. This change of state is assumed to be probabilistic with the next state being determined only by the preceding state, the chosen action, and the transition probability. The agent also receives a reward that is a function of the taken action, the preceding state, and the subsequent state. In an MDP, the objective of the agent is to maximize the return defined as the reward accumulated over a time horizon. In some applications, it is common to consider the horizon to be infinite, in which case the future rewards are discounted by a factor smaller than one. The selection of action by the agent on the basis of the observed state is known as the policy. More formally, a policy is a probabilistic mapping from states to each possible action. Because the policy and the reward are a function of the state, it is critical to estimate the utility of being in a certain state. More specifically, the value function is defined as the expected return starting from a given state under the chosen policy. Under this formalism, the objective of the agent is to find the optimal policy that maximizes the value function for all states. Reinforcement Learning Methods Action-value methods are a class of reinforcement learning methods in which the actions are chosen on the basis of the estimation of their long-term value. A prominent example of an action-value method is Q-learning in which the agent iteratively takes actions with the highest estimated values and updates the action-state value function on the basis of new observations. Policy gradient methods are another class of reinforcement learning methods that seek to optimize the policy directly instead of choosing actions on the basis of their respective estimated value. Such methods could be advantageous in health care applications that entail a large number of possible actions, e.g., when recommending a wide range of drug dosages or treatment options. Clinical Applications Reinforcement learning frameworks and methods are broadly applicable to clinical settings in which decisions are made sequentially. A prominent clinical application of reinforcement learning is for treatment recommendation, which has been studied across a variety of diseases and treatments including radiation and chemotherapy for cancer, brain stimulation for epilepsy, and treatment strategies for sepsis.2–5 In such treatment recommendation settings, a policy is commonly known as a dynamic treatment regime. There are various other clinical applications of reinforcement learning including diagnosis, medical imaging, and decision support tools (see refs. 2–5 and the references therein). Reinforcement Learning in Nephrology Although there have been recent applications of machine learning in nephrology,6,7 to the best of the authors' knowledge, the application of reinforcement learning to nephrology has been primarily limited to optimizing the erythropoietin dosage in hemodialysis patients.8,9 However, there are other settings where reinforcement learning has the potential to improve patient care in nephrology. For example, reinforcement learning methods can be adopted in the treatment of the complications of AKI or CKD (Figure 1). In this problem, the state models the conditions of the patient (e.g., vital signs, laboratory test results including urine and blood tests, and urine output measurements). The action refers to the treatment options (e.g., the dosage of medications such as sodium polystyrene sulfonate, and hemodialysis). The reward models the improvement in patient conditions. Similarly, reinforcement learning can help automate and optimize the dosage of immunosuppressive drugs in kidney transplants. Challenges and Opportunities Despite the success of reinforcement learning in several simplified clinical settings, their large-scale application to patient care faces several open challenges. The complexity of human biology complicates modeling clinical decision making as a reinforcement learning problem. The state space in such settings is often enormous, which could make a purely computational approach infeasible. Moreover, modeling all potential objectives a priori as a reward function may not be feasible. To overcome these challenges and realize the potential of reinforcement learning, clinical insight can play a pivotal role. More specifically, restricting the state space to only include highly relevant clinical variables could greatly reduce the computational complexity. Furthermore, using inverse reinforcement learning,2 relevant reward functions can be learned from retrospective studies assuming the optimality of clinical decisions. Another critical challenge is addressing moral and ethical concerns. It is imperative to ensure that reinforcement learning methods do not cause harm to the patient. To this end, there exists a need for a thorough validation of such methods before their use in patient care. Hence, there is a need to go beyond retrospective studies that have been used for the proof of concept of most existing reinforcement learning methods in health care applications.2,3 The lessons learned from the success of reinforcement learning in other application areas (e.g., self-driving cars) can help navigate the path to realizing its potential in health care. Accessible open-source simulation environments that enable researchers to compare various approaches are essential to the field of reinforcement learning. OpenAI Gym is currently the leading toolkit containing a wide range of simulated environments, e.g., surgical robotics.10 The development of high-quality and reliable simulation environments for nephrology and other health care applications can facilitate the development and validation of reinforcement learning methods beyond limited retrospective studies. The adoption of methods validated in such simulation environments in actual clinical settings will require clinicians' oversight. Similar to how self-driving cars require a human driver to ensure collision avoidance, clinicians' oversight is critical to ensure the safety of the patients, especially in the early stages of the adoption of reinforcement learning methods. The data from clinicians' decisions (e.g., overruling the automated treatment recommendation) can be used to improve the reliability of autonomous systems over time and reduce the burden of clinicians' oversight.

  • Dissertation
  • Cite Count Icon 3
  • 10.1184/r1/8397962.v1
Towards Generalization and Efficiency in Reinforcement Learning
  • Jul 2, 2019
  • Wen Sun

Different from classic Supervised Learning, Reinforcement Learning (RL), is fundamentally interactive : an autonomous agent must learn howto behave in an unknown, uncertain,<br>and possibly hostile environment, by actively interacting with the environment to collect useful feedback to improve its sequential decision making ability. The RL agent will also<br>intervene in the environment: the agent makes decisions which in turn affects further evolution of the environment.<br>Because of its generality– most machine learning problems can be viewed as special cases– RL is hard. As there is no direct supervision, one central challenge in RL is how to<br>explore an unknown environment and collect useful feedback efficiently. In recent RL success stories (e.g., super-human performance on video games [Mnih et al., 2015]), we notice<br>that most of them rely on random exploration strategies, such as -greedy. Similarly, policy gradient method such as REINFORCE [Williams, 1992], perform exploration by injecting randomness into action space and hope the randomness can lead to a good sequence of actions that achieves high total reward. The theoretical RL literature has developed more sophisticated algorithms for efficient exploration (e.g., [Azar et al., 2017]), however, the sample<br>complexity of these near-optimal algorithms has to scale exponentially with respect to key parameters of underlying systems such as dimensions of state and action space. Such exponential dependence prohibits a direct application of these theoretically elegant RL algorithms to large-scale applications. In summary, without any further assumptions, RL is hard, both in practice and in theory. In this thesis, we attempt to gain purchase on the RL problem by introducing additional assumptions and sources of information.The first contribution of this thesis comes from improving RL sample complexity via imitation learning. Via leveraging expert’s demonstrations, imitation learning significantly simplifies the tasks of exploration. We consider two settings in this thesis: interactive imitation learning setting where an expert is available to query during training time, and the setting of imitation learning from observation alone, where we only have a set of demonstrations that consist of observations of the expert’s states (no expert actions are recorded). We study in both theory and in practice how one can imitate<br>experts to reduce sample complexity compared to a pure RL approach. The second contribution comes from model-free Reinforcement Learning. Specifically, we study policy<br>evaluation by building a general reduction from policy evaluation to no-regret online learning which is an active research area that has well-established theoretical foundation. Such a reduction creates a new family of algorithms for provably correct policy evaluation under<br>very weak assumptions on the generating process. We then provide a through theoretical study and empirical study of two model-free exploration strategies: exploration in action<br>space and exploration in parameter space. The third contribution of this work comes from model-based Reinforcement Learning. We provide the first exponential sample complexity separation between model-based RL and general model-free RL approaches. We then provide PAC model-based RL algorithm that can achieve sample efficiency simultaneously for many interesting MDPs such as tabular MDPs, Factored MDPs, Lipschitz continuous<br>MDPs, low rank MDPs, and Linear Quadratic Control. We also provide a more practical model-based RL framework, called Dual Policy Iteration (DPI), via integrating optimal<br>control, model learning, and imitation learning together. Furthermore, we show a general convergence analysis that extends the existing approximate policy iteration theories to DPI. DPI generalizes and provides the first theoretical foundation for recent successful practical RL algorithms such as ExIt and AlphaGo Zero [Anthony et al., 2017, Silver et al., 2017], and provides a theoretical sound and practically efficient way of unifying model-based and<br>model-free RL approaches. <br>

  • Research Article
  • Cite Count Icon 24
  • 10.1016/j.tics.2020.09.002
Artificial Intelligence and the Common Sense of Animals.
  • Oct 8, 2020
  • Trends in Cognitive Sciences
  • Murray Shanahan + 3 more

Artificial Intelligence and the Common Sense of Animals.

  • Research Article
  • Cite Count Icon 77
  • 10.1037/abn0000259
Explicit and implicit reinforcement learning across the psychosis spectrum.
  • Jul 1, 2017
  • Journal of Abnormal Psychology
  • Deanna M Barch + 9 more

Motivational and hedonic impairments are core features of a variety of types of psychopathology. An important aspect of motivational function is reinforcement learning (RL), including implicit (i.e., outside of conscious awareness) and explicit (i.e., including explicit representations about potential reward associations) learning, as well as both positive reinforcement (learning about actions that lead to reward) and punishment (learning to avoid actions that lead to loss). Here we present data from paradigms designed to assess both positive and negative components of both implicit and explicit RL, examine performance on each of these tasks among individuals with schizophrenia, schizoaffective disorder, and bipolar disorder with psychosis, and examine their relative relationships to specific symptom domains transdiagnostically. None of the diagnostic groups differed significantly from controls on the implicit RL tasks in either bias toward a rewarded response or bias away from a punished response. However, on the explicit RL task, both the individuals with schizophrenia and schizoaffective disorder performed significantly worse than controls, but the individuals with bipolar did not. Worse performance on the explicit RL task, but not the implicit RL task, was related to worse motivation and pleasure symptoms across all diagnostic categories. Performance on explicit RL, but not implicit RL, was related to working memory, which accounted for some of the diagnostic group differences. However, working memory did not account for the relationship of explicit RL to motivation and pleasure symptoms. These findings suggest transdiagnostic relationships across the spectrum of psychotic disorders between motivation and pleasure impairments and explicit RL. (PsycINFO Database Record

  • Research Article
  • Cite Count Icon 24
  • 10.1093/ptj/pzz043
Robot Reinforcement and Error-Based Movement Learning in Infants With and Without Cerebral Palsy.
  • Jun 1, 2019
  • Physical Therapy
  • Thubi H A Kolobe + 1 more

Prone mobility, central to development of diverse psychological and social processes that have lasting effects on life participation, is seldom attained by infants with cerebral palsy (CP) and has no tested interventions. Reinforcement learning (RL) and error-based movement learning (EBL) offer novel intervention possibilities. This study examined movement learning strategies in infants with or at risk for CP using RL and EBL during acquisition of prone locomotion. The study was a randomized trial that used repeated measures. The study setting was a university physical therapy clinic in the United States. Thirty infants aged 4.5 to 6.5 months participated in the study: 24 had or were at risk for CP, and 6 were typically developing. Infants with and at risk for CP were randomly assigned to a combination of RL and EBL (SIPPC-RE), or RL only (SIPPC-R) conditions. Infants with typical development comprised the RL-only reference group (SIPPC-TD). Infants trained in prone locomotion with the Self-Initiated Prone Progression Crawler (SIPPC) robotic system for three 5-minute trials, twice a week for 12 weeks in their homes or child care. All training sessions were videotaped for behavioral coding. The SIPPC gathered robot and infant trunk/limb movement data. Randomized 2-way analysis of variance with repeated measures and Pearson r to analyze the data was used. Results included the number of arm movements and trial-and-error activity distinguished between the SIPPC-RE and SIPPC-R groups. The mean change in arm movements from baseline for the SIPPC-RE and SIPPC-R groups was 4.8 m and -7.0 m, respectively. The mean differences in rotational amplitude (trial and error) from baseline to the end of the study were 278 degrees and 501 degrees, respectively. These changes were correlated with distance traveled and goal-directed movements. The latter increased over the 12 weeks for the SIPPC-RE and SIPPC-TD groups, but not the SIPPC-R group. The CP groups were unequal due to reassignment and did not include a typically developing comparison group of a combination of RL and EBL. These findings suggest movement learning and retention in infants with CP is differentially affected by the use of RL and EBL, with a combination of both showing more promise than RL alone. The findings also implicate cognition, type of brain insult, emergence of reaching, and muscle force production, which must be explored in future studies.

  • Research Article
  • 10.1016/j.pnpbp.2024.111173
Disentangling negative reinforcement, working memory, and deductive reasoning deficits in elevated BMI
  • Oct 12, 2024
  • Progress in Neuropsychopharmacology & Biological Psychiatry
  • Gibson Weydmann + 6 more

Disentangling negative reinforcement, working memory, and deductive reasoning deficits in elevated BMI

  • Research Article
  • Cite Count Icon 1
  • 10.2139/ssrn.3424668
Variance Reduction in Actor Critic Methods (ACM)
  • Jan 1, 2019
  • SSRN Electronic Journal
  • Eric Benhamou

After presenting Actor Critic Methods (ACM), we show ACM are control variate estimators. Using the projection theorem, we prove that the Q and Advantage Actor Critic (A2C) methods are optimal in the sense of the $L^2$ norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC most often referred to as A2C methods in deep policy gradient methods. This enables us to derive a new formulation for Advantage Actor Critic methods that has lower variance and improves the traditional A2C method.

  • Conference Article
  • 10.69997/sct.132857
Towards Self-Tuning PID Controllers: A Data-Driven, Reinforcement Learning Approach for Industrial Automation
  • Jul 1, 2025
  • Kyle Territo + 2 more

As industries embrace the digitalization of Industry 4.0, the abundance of process data creates new opportunities to optimize industrial control systems. Traditional Proportional-Integral-Derivative (PID) controllers often require manual tuning to address changing conditions. This paper introduces an automated, adaptive PID tuning method using historical data and machine learning for a continuously evolving, data-driven approach. The method centers on training a surrogate model using historical process data to replicate real system behavior under various conditions. This enables safe exploration of control strategies without disrupting live operations. An RL (Reinforcement Learning) agent interacts with the surrogate model to learn optimal control policies, dynamically responding to the plant's state, defined by variables like operational conditions and measured disturbances. The agent adjusts PID parameters in real-time, optimizing metrics such as stability, response time, and energy efficiency. After training, the RL agent is deployed online to monitor and adjust PID controllers in response to real-time deviations. The system continuously integrates new data to refine the surrogate model and RL agent, ensuring adaptability to long-term process changes. This continuous learning enhances resilience and scalability, maintaining optimal performance in dynamic environments. By combining data-driven modeling with RL, this method automates PID tuning, maximizing process data utility and aligning with Industry 4.0 principles. It reduces manual oversight while improving efficiency, reliability, and sustainability, addressing the challenges of increasingly complex and data-rich industrial systems.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.