Corrigendum to “Safe fixed-time reinforcement learning for nonlinear zero-sum games with obstacle avoidance awareness” [Automatica 183 (2025) 112673

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Corrigendum to “Safe fixed-time reinforcement learning for nonlinear zero-sum games with obstacle avoidance awareness” [Automatica 183 (2025) 112673

Similar Papers
  • Research Article
  • Cite Count Icon 11
  • 10.1016/j.jhydrol.2023.129435
Flooding mitigation through safe & trustworthy reinforcement learning
  • Mar 24, 2023
  • Journal of Hydrology
  • Wenchong Tian + 5 more

Flooding mitigation through safe & trustworthy reinforcement learning

  • PDF Download Icon
  • Conference Article
  • 10.1145/3576841.3585936
Self-Preserving Genetic Algorithms for Safe Learning in Discrete Action Spaces
  • May 9, 2023
  • Preston K Robinette + 2 more

Self-Preserving Genetic Algorithms (SPGA) combine the evolutionary strategy of a genetic algorithm with safety assurance methods commonly implemented in safe reinforcement learning (SRL), a branch of reinforcement learning (RL) that accounts for safety in the exploration and decision-making process of the agent. Safe learning approaches are especially important in safety-critical environments, where failure to account for the safety of the controlled system could result in the loss of millions of dollars in hardware or bodily harm to people working nearby, as is true of many cyber-physical systems. While SRL is a viable approach to safe learning, there are many challenges that must be taken into consideration when training agents, such as sample efficiency, stability, and exploration---an issue that is easily addressed by the evolutionary strategy of a genetic algorithm. By combining GAs with the safety mechanisms used with SRL, SPGA offers a safe learning alternative that is able to explore large areas of the solution space, addressing SRL's challenge of exploration. This work implements SPGA with both action masking and run time assurance safety strategies to evolve safe controllers for three types of discrete action space environments applicable to cyber physical systems (control, routing, and operations) and under various safety conditions. Training and testing evaluation metrics are compared with results from SRL trained controllers to validate results. SPGA and SRL controllers are trained across 5 random seeds and evaluated on 500 episodes to calculate average wall time to train, average expected return, and percentage of safe action evaluation metrics. SPGA achieves comparable reward and safety performance results with significantly improved training efficiency (55x faster on average), demonstrating the effectiveness of this safe learning approach.

  • Conference Article
  • 10.1145/3576841.3589635
DEMO: Self-Preserving Genetic Algorithms vs. Safe Reinforcement Learning in Discrete Action Spaces
  • May 9, 2023
  • Preston K Robinette + 2 more

Safe learning techniques are learning frameworks that take safety into consideration during the training process. Safe reinforcement learning (SRL) combines reinforcement learning (RL) with safety mechanisms such as action masking and run time assurance to protect an agent during the exploration of its environment. This protection, though, can severely hinder an agent's ability to learn optimal policies as the safety systems exacerbate an already difficult exploration challenge for RL agents. An alternative to RL is an optimization approach known as genetic algorithms (GA), which utilize operators that mimic biological evolution to evolve better policies. By combining safety mechanisms with genetic algorithms, this work demonstrates a novel approach to safe learning called Self-Preserving Genetic Algorithms.

  • Conference Article
  • Cite Count Icon 13
  • 10.23919/acc53348.2022.9867652
Computationally Efficient Safe Reinforcement Learning for Power Systems
  • Jun 8, 2022
  • Daniel Tabas + 1 more

We propose a computationally efficient approach to safe reinforcement learning (RL) for frequency regulation in power systems with high levels of variable renewable energy resources. The approach draws on set-theoretic control techniques to craft a neural network-based control policy that is guaranteed to satisfy safety-critical state constraints, without needing to solve a model predictive control or projection problem in real time. By exploiting the properties of robust controlled-invariant polytopes, we construct a novel, closed-form "safety-filter" that enables end-to-end safe learning using any policy gradient-based RL algorithm. We then apply the safety filter in conjunction with the deep deterministic policy gradient (DDPG) algorithm to regulate frequency in a modified 9-bus power system, and show that the learned policy is more cost-effective than robust linear feedback control techniques while maintaining the same safety guarantee. We also show that the proposed paradigm outperforms DDPG augmented with constraint violation penalties.

  • Book Chapter
  • Cite Count Icon 13
  • 10.1007/978-3-030-73959-1_12
Safe Learning and Optimization Techniques: Towards a Survey of the State of the Art
  • Jan 1, 2021
  • Youngmin Kim + 2 more

Safe learning and optimization deals with learning and optimization problems that avoid, as much as possible, the evaluation of non-safe input points, which are solutions, policies, or strategies that cause an irrecoverable loss (e.g., breakage of a machine or equipment, or life threat). Although a comprehensive survey of safe reinforcement learning algorithms was published in 2015, a number of new algorithms have been proposed thereafter, and related works in active learning and in optimization were not considered. This paper reviews those algorithms from a number of domains including reinforcement learning, Gaussian process regression and classification, evolutionary algorithms, and active learning. We provide the fundamental concepts on which the reviewed algorithms are based and a characterization of the individual algorithms. We conclude by explaining how the algorithms are connected and suggestions for future research.

  • Dissertation
  • 10.26083/tuprints-00017536
Off-Policy Reinforcement Learning for Robotics
  • Mar 30, 2021
  • Samuele Tosatto

Nowadays, industrial processes are vastly automated by means of robotic manipulators. In some cases, robots occupy a large fraction of the production line, performing a rich range of tasks. In contrast to their tireless ability to repeatedly perform the same tasks with millimetric precision, current robotics exhibits low adaptability to new scenarios. This lack of adaptability in many cases hinders a closer human-robot interaction; furthermore, when one needs to apply some change to the production line, the robots need to be reconfigured by highly-qualified figures. Machine learning and, more particularly, reinforcement learning hold the promise to provide automated systems that can adapt to new situations and learn new tasks. Despite the overwhelming progress in recent years in the field, the vast majority of reinforcement learning is not directly applicable to real robotics. State-of-the-art reinforcement learning algorithms require intensive interaction with the environment and are unsafe in the early stage of learning when the policy perform poorly and potentially harms the systems. For these reasons, the application of reinforcement learning has been successful mainly on simulated tasks such as computer- and board-games, where it is possible to collect a vast amount of samples in parallel, and there is no possibility to damage any real system. To mitigate these issues, researchers proposed first to employ imitation learning to obtain a reasonable policy, and subsequently to refine it via reinforcement learning. In this thesis, we focus on two main issues that prevent the mentioned pipe-line from working efficiently: (i) robotic movements are represented with a high number of parameters, which prevent both safe and efficient exploration; (ii) the policy improvement is usually on-policy, which also causes inefficient and unsafe updates. In Chapter 3 we propose an efficient method to perform dimensionality reduction of learned robotic movements, exploiting redundancies in the movement spaces (which occur more commonly in manipulation tasks) rather than redundancies in the robot kinematics. The dimensionality reduction allows the projection to latent spaces, representing with high probability movements close to the demonstrated ones. To make reinforcement learning safer and more efficient, we define the off-policy update in the movement’s latent space in Chapter 4. In Chapter 5, we propose a novel off-policy gradient estimation, which makes use of a particular non-parametric technique named Nadaraya-Watson kernel regression. Building on a solid theoretical framework, we derive statistical guarantees. We believe that providing strong guarantees is at the core of a safe machine learning. In this spirit, we further expand and analyze the statistical guarantees on Nadaraya-Watson kernel regression in Chapter 6. Usually, to avoid challenging exploration in reinforcement learning applied to robotics, one must define highly engineered reward-function. This limitation hinders the possibility of allowing non-expert users to define new tasks. Exploration remains an open issue in high-dimensional and sparse reward. To mitigate this issue, we propose a far-sighted exploration bonus built on information-theoretic principles in Chapter 7. To test our algorithms, we provided a full analysis both on simulated environment, and in some cases on real world robotic tasks. The analysis supports our statement, showing that our proposed techniques can safely learn in the presence of a limited set of demonstration and robotic interactions.

  • Research Article
  • Cite Count Icon 14
  • 10.1109/lra.2020.2990743
Data-Efficient and Safe Learning for Humanoid Locomotion Aided by a Dynamic Balancing Model
  • Jul 1, 2020
  • IEEE Robotics and Automation Letters
  • Junhyeok Ahn + 2 more

In this letter, we formulate a novel Markov Decision Process (MDP) for safe and data-efficient learning for humanoid locomotion aided by a dynamic balancing model. In our previous studies of biped locomotion, we relied on a low-dimensional robot model, commonly used in high-level Walking Pattern Generators (WPGs). However, a low-level feedback controller cannot precisely track desired footstep locations due to the discrepancies between the full order model and the simplified model. In this study, we propose mitigating this problem by complementing a WPG with reinforcement learning. More specifically, we propose a structured footstep control method consisting of a WPG, a neural network, and a safety controller. The WPG provides an analytical method that promotes efficient learning while the neural network maximizes long-term rewards, and the safety controller encourages safe exploration based on step capturability and the use of control-barrier functions. Our contributions include the following (1) a structured learning control method for locomotion, (2) a data-efficient and safe learning process to improve walking using a physics-based model, and (3) the scalability of the procedure to various types of humanoid robots and walking.

  • Research Article
  • Cite Count Icon 9
  • 10.1177/09596518231153445
Safe deep reinforcement learning in diesel engine emission control.
  • Feb 17, 2023
  • Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering
  • Armin Norouzi + 4 more

A deep reinforcement learning application is investigated to control the emissions of a compression ignition diesel engine. The main purpose of this study is to reduce the engine-out nitrogen oxide emissions and to minimize fuel consumption while tracking a reference engine load. First, a physics-based engine simulation model is developed in GT-Power and calibrated using experimental data. Using this model and a GT-Power/Simulink co-simulation, a deep deterministic policy gradient is developed. To reduce the risk of an unwanted output, a safety filter is added to the deep reinforcement learning. Based on the simulation results, this filter has no effect on the final trained deep reinforcement learning; however, during the training process, it is crucial to enforce constraints on the controller output. The developed safe reinforcement learning is then compared with an iterative learning controller and a deep neural network-based nonlinear model predictive controller. This comparison shows that the safe reinforcement learning is capable of accurately tracking an arbitrary reference input while the iterative learning controller is limited to a repetitive reference. The comparison between the nonlinear model predictive control and reinforcement learning indicates that for this case reinforcement learning is able to learn the optimal control output directly from the experiment without the need for a model. However, to enforce output constraint for safe learning reinforcement learning, a simple model of system is required. In this work, reinforcement learning was able to reduce emissions more than the nonlinear model predictive control; however, it suffered from slightly higher error in load tracking and a higher fuel consumption.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/acpee56931.2023.10135995
Research and Application of Safe Reinforcement Learning in Power System
  • Apr 1, 2023
  • Jian Li + 3 more

Agent exploration of reinforcement learning is a necessary way for reinforcement learning algorithms to obtain information. In order to obtain more exploratory information, some deep reinforcement learning algorithms even increase the exploration of agents. Reinforcement learning has been successfully applied in many intelligent control fields, however unlimited exploration may bring disastrous consequences to agents, there are still many concerns that need attention in the application of real world, one of which is the safety issue. The safe reinforcement learning approximately enforces the constraint conditions in each policy update, thus further improving the security and robustness of intelligent algorithm. Furthermore, according to the particularity of electric energy production, transmission and consumption, power system operation needs to meet the requirements of safety, stability and efficiency. This paper summarizes the theory and characteristics of safe reinforcement learning, and then discusses the application of safe reinforcement learning in power system. Finally, we propose a prospect for the challenging problems of safe reinforcement learning in power field.

  • Conference Article
  • Cite Count Icon 20
  • 10.15607/rss.2012.viii.011
Reducing Conservativeness in Safety Guarantees by Learning Disturbances Online: Iterated Guaranteed Safe Online Learning
  • Jul 9, 2012
  • Jeremy Gillula + 1 more

Reinforcement learning has proven itself to be a powerful technique in robotics, however it has not often been employed to learn a controller in a hardware-in-the-loop environment due to the fact that spurious training data could cause a robot to take an unsafe (and potentially catastrophic) action.One approach to overcoming this limitation is known as Guaranteed Safe Online Learning via Reachability (GSOLR), in which the controller being learned is wrapped inside another controller based on reachability analysis that seeks to guarantee safety against worst-case disturbances.This paper proposes a novel improvement to GSOLR which we call Iterated Guaranteed Safe Online Learning via Reachability (IGSOLR), in which the worst-case disturbances are modeled in a state-dependent manner (either parametrically or nonparametrically), this model is learned online, and the safe sets are periodically recomputed (in parallel with whatever machine learning is being run online to learn how to control the system).As a result the safety of the system automatically becomes neither too liberal nor too conservative, depending only on the actual system behavior.This allows the machine learning algorithm running in parallel the widest possible latitude in performing its task while still guaranteeing system safety.In addition to explaining IGSOLR, we show how it was used in a real-world example, namely that of safely learning an altitude controller for a quadrotor helicopter.The resulting controller, which was learned via hardware-inthe-loop reinforcement learning, out-performs our original handtuned controller while still maintaining safety.To our knowledge, this is the first example in the robotics literature of an algorithm in which worst-case disturbances are learned online in order to guarantee system safety.

  • Research Article
  • Cite Count Icon 1
  • 10.1002/rnc.5442
Optimal control and learning for cyber‐physical systems
  • Feb 18, 2021
  • International Journal of Robust and Nonlinear Control
  • Yan Wan + 3 more

Modern systems are becoming increasingly complex in their functionality, structure, and dynamics. A successful management of these systems requires enhanced performance in terms of robustness, safety, resiliency, scalability, and usability. To achieve these performance requirements, it is important to adopt cyber-physical system (CPS) design techniques. A CPS features tightly coupled physical and cyber components. The physical components include system dynamics, sensors, controllers, and the uncertain environment in which the system operates. The cyber components include data, communication, control, and computation. The CPS co-design principle suggests that these physical and cyber components should be designed as an integrated whole. CPS studies cross boundaries of multiple science and engineering disciplines and require deep domain knowledge. CPS applications span intelligent transportation, unmanned systems, smart grids, smart homes, smart health care, smart materials, and intelligent civil infrastructures. Developing practical closed-loop optimal decisions is a common and pivotal task for these CPS applications. The optimal control theory finds its significant value in developing such solutions. However, the traditional optimal control theory cannot be directly used because it was developed for systems that do not have the complexity level of modern systems we observe today. Significant challenges exist in developing practical optimal control solutions for real CPSs, considering the increased level of complexity and challenging performance requirements aforementioned. Addressing these challenges requires a seamless integration of the optimal control theory with advances from learning and other science and engineering domains. The performance of such integration or co-design is not fully understood or developed. This special issue focuses on the optimal control theory and learning for CPSs. The papers received span broad topics including learning and data-driven optimal control to address physical unknowns and disturbances, estimation techniques to deal with uncertainties; secure and resilient solutions that account for disturbances, faults, and attacks; control solutions subject to physical constraints on energy, actuation, communication, and computation; and CPS applications toward robotics and power grids. The 18 accepted papers are categorized into five directions and summarized as follows: In the paper by Fu et al. titled “Resilient consensus-based distributed optimization under deception attacks,” the authors investigate the problem of distributed optimization subject to L-local deception attacks. In such attacks, the attackers can modify the information transmitted through at most L communication links to each node. The authors design a resilient consensus-based distributed optimization algorithm, where the nodes cooperatively estimate the optimizers according to their subgradients and their partial neighbors' estimates. Conditions are developed for all nodes to agree on their estimations and reach a resilient optimal solution. The paper by Zhai and Vamvoudakis studies replay attacks to linear quadratic Gaussian (LQG) zero-sum games. The authors develop a data-based and private learning framework to detect and mitigate replay attacks. A Neyman–Pearson detector is used to detect replay attacks. Optimal watermarking signals are added to help the detection and achieve a trade-off between the detection performance and control performance loss. A data-based technique is developed to learn the best defending strategy in the presence of worst-case disturbances, stochastic noise, and replay attacks. Denial-of-service (DoS) attacks that block information exchange are also common in CPS. In the paper titled “Event-based resilience to DoS attacks on communication for consensus of networked Lagrangian systems,” Li et al. study the resilience of event-based consensus control of networked Lagrangian systems under DoS attacks. An event-based controller is designed in the absence of DoS attacks, and the resilience for the controller is analyzed under DoS attacks. Some conditions associated with the DoS duration and frequency are identified for the resilience of the control system. Periodic energy-limited DoS attacks are considered in the paper titled “Event-triggered output synchronization for nonhomogeneous agent systems with periodic denial-of-service attacks.” The authors Xu et al. develop a two-layer control framework for each agent of nonhomogeneous linear dynamics. The first layer is a dynamic compensator to track the dynamics of the leader node using an event-triggered protocol, and the second layer is an output regulator to synchronize to the compensator dynamics. The paper authored by Zhang et al. studies the sliding mode control of a class of interval type-2 fuzzy systems subject to intermittent DoS attacks. A switched type-2 fuzzy estimator is designed that serves as a state observer to estimate immeasurable states when DoS attacks are absent and serves as a compensator to generate measurement signals for the control when DoS attacks are in place. A switched sliding mode control is then developed in both attack free and attack active cases. Acceptable DoS region is analyzed using the switched Lyapunov analysis. The dynamics of physical components of CPS may not be completely known. Reinforcement learning is data-driven adaptive optimal control that does not require the full knowledge of physicals dynamics. The article authored by Guo et al. studies state feedback and output feedback Q-learning of a two-wheeled self-balancing robot (TWSBR). The solution realizes linear quadratic regulation (LQR) control without any knowledge of the system parameters. The controls feature a decoupling mechanism and pre-feedback to overcome computational difficulties of Q-learning, and the output Q-learning does not use discounting factors in the cost function to guarantee the closed-loop stability. Environment is also a physical component of CPS. The unknown environment, such as wind field, modulates system dynamics but may be unknown. The paper by He et al. studies minimum time-energy path planning in continuous state and control input spaces subject to unknown environment. The authors design an approximate cost function to capture both the minimum time-energy objective and actuation constraints, based on which an integral reinforcement learning (IRL)-based optimal control is developed without knowledge of the environmental disturbance. Convergence of the IRL-based control is proven. Safety is another critical concern for CPS learning-based controls. In the paper titled “Safe reinforcement learning: A control barrier function optimization approach,” the authors Marvi and Kiumarsi design a safe reinforcement learning scheme that achieves both safety and optimal control performance. This is achieved through a design that incorporates a control barrier function into the optimal control cost function without affecting the stability and optimality within the safe region. An off-policy RL is developed to learn optimal safety policy without the complete knowledge of system dynamics. In order to address both the performance optimization objective and the disturbance rejection objective, the authors Yang et al. study a mixed H2/H∞ performance optimization problem for general nonlinear dynamics with polynomial dynamics. The problem is formulated as a nonzero sum game, and a policy iteration-based framework is developed using the Hamiltonian inequality. A relaxed mixed sum-of-squares based iterative algorithm is then developed for the optimization problem, which includes both a policy improvement step for the H2 performance and policy guarantee step for the H∞ performance. In the paper titled “Deep Koopman model predictive control for enhancing transient stability in power grids,” the authors Ping et al. develop a data-driven control framework to address the challenge of nonlinear complexity in power grids. A deep neural network (DNN)-based approximate Koopman operator is used to map the original nonlinear grid dynamics into a finite dimensional linear space. A model predictive control strategy is then developed to enhance the transient stability of power grids through smartly utilizing energy storage systems in the presence of faults. The paper “Event-triggered distributed model predictive control for resilient voltage control of an islanded microgrid” by Ge et al. addresses the problem of distributed secondary voltage control of an microgrid in the islanded mode. An event-triggered distributed model predictive control scheme is designed for voltage regulation with reduced communication and computation loads subject to communication failures. A finite-time adaptive non-asymptotic observer is also designed to address the nonlinear dynamics and to facilitate the output feedback control. Automated demand response (ADR) is used to automatically control costumer power consumptions. In the paper titled “Stochastic modeling and scalable predictive control for automated demand response,” Kobayashi and Hiraishi use Markov chains to capture the complex behavior of power consumption and formulate the ADR problem as model predictive control. To solve the control problem, a mixed integer linear programming solution is developed to choose control strategy from a finite set. The method is scalable to the number of consumers. Uncertainties are common to CPS and especially human-CPS where uncertain human intensions need to be learned. Expert based ensemble learning algorithms can learn unknown probability distributions online. The paper by Young et al. develops evaluation metrics for N-expert ensemble learning algorithms named adaptiveness and consistency. Markov chain analysis is adopted to obtain quantitative relationships between mean hitting time, adaptiveness, and consistency for three different ensemble learning algorithms. Human-robot interaction studies are conducted to validate the analysis. In the paper titled “Expectation maximization based sparse identification of cyber physical system,” the authors Tang and Dong address the identification of hybrid nonlinear CPS models. A two-stage identification algorithm is developed that uses expectation maximization to identify all subsystems in the first step and then discovers the transition logic between subsystems using sparse logistic regression. Hybrid system examples are studied to demonstrate robustness of the identification approach. In the paper “Stationary target localization and circumnavigation by a non-holonomic differentially driven mobile robot: Algorithms and experiments,” the authors Wang et al. consider the problem of circumnavigation when the target location is unknown and bearing-only measurements are available. After output feedback linearization, a two-step control algorithm is applied to the dynamics, including target location estimation and circumnavigation. Estimation and trajectory errors in both steps of the control are proven to converge, and the control is also verified by experimental and simulation studies. The paper by Battilotti et al. studies the distributed infinite-horizon LQG optimal control for networked continuous-time systems. A distributed solution is developed when only local information of the network is available to nodes. This is achieved through first designing distributed LQG that depends on network information and then equipping it with a Push-Sum algorithm to compute network information in a distributed manner. The proposed control performance is proven to be arbitrarily close to the centralized case. The paper by Chen et al. focuses on the finite-time consensus of second-order multiagent systems with both input saturation and disturbances, and develops distributed controllers using relative position and relative velocity measurements in both leader-following and leadless cases. A continuous integral sliding mode method is designed to deal with bounded disturbances. The controller guarantees that the system maintains in the sliding mode from any initial state regardless of disturbances and finite-time consensus can be achieved. The paper by Wang et al. studies distributed sliding mode control for leader-follower formation flight of fixed-wing unmanned aerial vehicles (UAVs) subject to velocity constraints. A distributed sliding mode control law is developed for each UAV under directed communication graph. The Lyapunov theory is used to prove that the error dynamics converge to the sliding mode surface and then to the origin in finite time. Formation can be achieved without requiring the adjustable range of follower linear velocity. The Guest Editors would like to thank the Editorial Office and the Editor-in-Chief of the International Journal of Robust and Nonlinear Control, Prof. Michael Grimble, for their support of this Special Issue. Wan and Lewis would also like to NSF grants 1714519 and 1839804 for the support of this work. In addition, we thank all the authors who submit their quality papers, and special thanks go to all the anonymous reviewers for their efforts and time to accomplish the review tasks.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-82153-1_36
Lane Keeping Algorithm for Autonomous Driving via Safe Reinforcement Learning
  • Jan 1, 2021
  • Qi Kong + 2 more

Autonomous driving can possibly facilitate the load and change the methods for transportation in our everyday life. Work is being done to create algorithms of decision making and motion control in Autonomous driving. Recently, reinforcement learning has been a predominant strategy applied for this purpose. But, problems of using reinforcement learning for autonomous driving is that the actions taken while exploration can be unsafe, and the convergence can be too slow. Therefore, before making an actual vehicle learn driving through reinforcement learning, there is an urgent need to solve the safety issue. The significance of this paper is that, it introduces Safe Reinforcement Learning (SRL) into the field of autonomous driving. Safe reinforcement learning is the method of adding constraints to ensure the safe exploration. This paper explores the Constrained Policy Optimization (CPO) algorithm. The principle is to introduce constraints in the cost function. CPO is based on the framework of the Actor-Critic algorithm where the space that is explored during the policy update process is enforced by setting tough constraints which reduces the size of policy update. A comparison is also made with typical reinforcement learning algorithms to prove its advantages in learning efficiency and safety.

  • Book Chapter
  • 10.1007/978-3-031-22216-0_5
Evaluation of Safe Reinforcement Learning with CoMirror Algorithm in a Non-Markovian Reward Problem
  • Jan 1, 2023
  • Megumi Miyashita + 2 more

In reinforcement learning, an agent in an environment improves the skill depending on a reward, which is the feedback from an environment. For practical, reinforcement learning has several important challenges. First, reinforcement learning algorithms often use assumptions for an environment such as Markov decision processes; however, the environment in the real world often cannot be represented by these assumptions. Especially we focus on the environment with non-Markovian rewards, which allows the reward to depend on past experiences. To handle non-Markovian rewards, researchers have used a reward machine, which decomposes the original task into the sub-tasks. In those works, they assume that the sub-tasks are usually represented by a Markov decision process. Second, safety is also one of the challenges in reinforcement learning. G-CoMDS is a safe reinforcement learning algorithm based on CoMirror algorithm, an algorithm for constrained optimization problems. We have developed G-CoMDS algorithm to learn safely under environments without a Markov decision process. Therefore, the promising approach in complex situations would be decomposing the original task as the reward machine does, then solving the sub-tasks with G-CoMDS. In this paper, we provide additional experimental results and discussions of G-CoMDS, as a preliminary step of combining G-CoMDS with a reward machine. We evaluate G-CoMDS and existing reinforcement learning algorithm in the mobile robot simulation with a kind of non-Markovian rewards. The experimental result shows that G-CoMDS has the effect of suppressing the cost spike and slightly exceeds the performance of the existing safe reinforcement learning algorithm.

  • Research Article
  • Cite Count Icon 3
  • 10.1080/09540091.2022.2151567
Safe reinforcement learning for dynamical systems using barrier certificates
  • Dec 12, 2022
  • Connection Science
  • Qingye Zhao + 2 more

Safety control is a fundamental problem in policy design. Basic reinforcement learning is effective at learning policy with goal-reaching property. However, it does not guarantee safety property of the learned policy. This paper integrates barrier certificates into actor-critic-based reinforcement learning methods in a feedback-driven framework to learn safe policies for dynamical systems. The safe reinforcement learning framework is composed of two interactive parts: Learner and Verifier. Learner trains the policy to satisfy goal-reaching and safety properties. Since the policy is trained on training datasets, the two properties may not be retained on the whole system. Verifier validates the learned policy on the whole system. If the validation fails, Verifier returns the counterexamples to Learner for retraining the policy in the next iteration. We implement a safe policy learning tool SRLBC and evaluate its performance on three control tasks. Experimental results show that SRLBC achieves safety with no more than 0.5× time overhead compared to the baseline reinforcement learning method, showing the feasibility and effectiveness of our framework.

  • Research Article
  • Cite Count Icon 8
  • 10.1007/s10015-019-00523-3
Multi-objective safe reinforcement learning: the relationship between multi-objective reinforcement learning and safe reinforcement learning
  • Feb 8, 2019
  • Artificial Life and Robotics
  • Naoto Horie + 4 more

Reinforcement learning (RL) is a learning method that learns actions based on trial and error. Recently, multi-objective reinforcement learning (MORL) and safe reinforcement learning (SafeRL) have been studied. The objective of conventional RL is to maximize the expected rewards; however, this may cause a fatal state because safety is not considered. Therefore, RL methods that consider safety during or after learning have been proposed. SafeRL is similar to MORL because it considers two objectives, i.e., maximizing expected rewards and satisfying safety constraints. However, to the best of our knowledge, no study has investigated the relationship between MORL and SafeRL to demonstrate that the SafeRL method can be applied to MORL tasks. This paper combines MORL with SafeRL and proposes a method for Multi-Objective SafeRL (MOSafeRL). We applied the proposed method to resource gathering task, which is a standard task used in MORL test cases.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.