Pure-Past Action Masking

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.

Similar Papers
  • Research Article
  • 10.1609/aaai.v39i15.33767
Probabilistic Shielding for Safe Reinforcement Learning
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Edwin Hamel-De Le Court + 2 more

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximize their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/icra48506.2021.9561593
Context-Aware Safe Reinforcement Learning for Non-Stationary Environments
  • May 30, 2021
  • Baiming Chen + 6 more

Safety is a critical concern when deploying reinforcement learning agents for realistic tasks. Recently, safe reinforcement learning algorithms have been developed to optimize the agent's performance while avoiding violations of safety constraints. However, few studies have addressed the nonstationary disturbances in the environments, which may cause catastrophic outcomes. In this paper, we propose the context-aware safe reinforcement learning (CASRL) method, a metal-earning framework to realize safe adaptation in non-stationary environments. We use a probabilistic latent variable model to achieve fast inference of the posterior environment transition distribution given the context data. Safety constraints are then evaluated with uncertainty-aware trajectory sampling. Prior safety constraints are formulated with domain knowledge to improve safety during exploration. The algorithm is evaluated in realistic safety-critical environments with non-stationary disturbances. Results show that the proposed algorithm significantly outperforms existing baselines in terms of safety and robustness.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icoac44903.2018.8939117
Double Q–learning Agent for Othello Board Game
  • Dec 1, 2018
  • Thamarai Selvi Somasundaram + 4 more

This paper presents the first application of the Double Q–learning algorithm to the game of Othello. Reinforcement learning has previously been successfully applied to Othello using the canonical reinforcement learning algorithms, Q–learning and TD–learning. However, the algorithms suffer from considerable drawbacks. Q–learning frequently tends to be overoptimistic during evaluation, while TD–learning can get stuck in local minima. To overcome the disadvantages of the existing work, we propose using a Double Q–learning agent to play Othello and prove that it performs better than the existing learning agents. In addition to developing and implementing the Double Q–learning agent, we implement the Q–learning and TD– learning agents. The agents are trained and tested against two fixed opponents: a random player and a heuristic player. The performance of the Double Q–learning agent is compared with performance of the existing learning agents. The Double Q– learning agent outperforms them, although it takes longer, on average, to make each move. Further, we show that the Double Q–learning agent performs at its best with two hidden layers using the tanh function.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.promfg.2020.11.013
Self-adaptive Traffic and Logistics Flow Control using Learning Agents and Ubiquitous Sensors
  • Jan 1, 2020
  • Procedia Manufacturing
  • Stefan Bosse

Self-adaptive Traffic and Logistics Flow Control using Learning Agents and Ubiquitous Sensors

  • Conference Article
  • Cite Count Icon 3
  • 10.24963/ijcai.2021/347
Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes
  • Aug 1, 2021
  • Aria Hasanzadezonuzy + 2 more

In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icsmc.2006.384788
An Efficient Multi-Agent Q-learning Method Based on Observing the Adversary Agent State Change
  • Oct 1, 2006
  • Ruoying Sun + 1 more

For the task under Markov decision processes, this paper investigates and presents a novel multi-agent reinforcement learning method based on the observing adversary agent state change. By observing the adversary agent state change and taking it as learning agents' observation to the environment, the learning agents extend the learning episodes, and derive more observation by less action. In the extreme, the learning agents can consider the adversary agent state change as their own exploration policy that allows learning agents to use exploitation for deriving maximal reward in the learning processes. Further, by the discussion about that the learning agents' cooperation is done by utilizing the direct communication and the indirect media communication, this paper also gives some descriptions about inexpensive features of both communication methods used in the proposed learning method. The direct communication enhances learning agents' ability of observing the task environment, and the indirect media communication helps learning agents to derive the optimal action policy efficiently. The simulation results on the hunter game demonstrate the efficiency of the proposed method.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.ifacol.2023.10.1420
Machine Learning Agents Augmented by Digital Twinning for Smart Production Scheduling
  • Jan 1, 2023
  • IFAC PapersOnLine
  • Kosmas Alexopoulos + 4 more

Machine Learning Agents Augmented by Digital Twinning for Smart Production Scheduling

  • Research Article
  • Cite Count Icon 1
  • 10.1109/tnnls.2024.3496492
GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model.
  • Jun 1, 2025
  • IEEE transactions on neural networks and learning systems
  • Zhehua Zhou + 4 more

Safe reinforcement learning (SRL) aims to realize a safe learning process for deep reinforcement learning (DRL) algorithms by incorporating safety constraints. However, the efficacy of SRL approaches often relies on accurate function approximations, which are notably challenging to achieve in the early learning stages due to data insufficiency. To address this issue, we introduce, in this work, a novel generalizable safety enhancer (GenSafe) that can overcome the challenge of data insufficiency and enhance the performance of SRL approaches. Leveraging model order reduction techniques, we first propose an innovative method to construct a reduced order Markov decision process (ROMDP) as a low-dimensional approximator of the original safety constraints. Then, by solving the reformulated ROMDP-based constraints, GenSafe refines the actions of the agent to increase the possibility of constraint satisfaction. Essentially, GenSafe acts as an additional safety layer for SRL algorithms. We evaluate GenSafe on multiple SRL approaches and benchmark problems. The results demonstrate its capability to improve safety performance, especially in the early learning phases, while maintaining satisfactory task performance. Our proposed GenSafe not only offers a novel measure to augment existing SRL methods but also shows broad compatibility with various SRL algorithms, making it applicable to a wide range of systems and SRL problems.

  • Conference Article
  • 10.1109/icsmc.2001.973538
Increasing the flexibility and speed of convergence of a learning agent
  • Dec 1, 2001
  • M.A.S Santibanez + 1 more

A review of the basic methods used to model a learning agent, such as instance-based learning, artificial neural networks and reinforcement learning, suggests that they either lack flexibility (can only be used to solve a small number of problems) or they tend to converge very slowly to the optimal policy. This paper describes and illustrates a set of processes that address these two shortcomings. The resulting learning agent is able to adapt fairly well to a much larger set of environments and is capable of doing this in a reasonable amount of time. In order to address the lack of flexibility and slow convergence to the optimal policy, the new learning agent becomes a hybrid between a learning agent based on instance-based learning and one based on reinforcement learning. To accelerate its convergence to its optimal policy, this new learning agent incorporates the use of a new concept we call propagation of good findings. Furthermore, to make a better use of the learning agent's memory resources,, and therefore increase its flexibility, we make use of another new concept we call moving prototypes.

  • Conference Article
  • Cite Count Icon 18
  • 10.1109/icassp.2019.8682983
Reinforcement Learning with Safe Exploration for Network Security
  • May 1, 2019
  • Canhuang Dai + 3 more

Safe reinforcement learning is important for the safety critical applications especially network security, as the exploration of some dangerous actions can result in huge short-term losses such as network failure or large scale privacy leakage. In this paper, we propose a reinforcement learning algorithm with safe exploration and uses transfer learning to reduce the initial random exploration. A blacklist is maintained to record the most dangerous state-action pairs as a safety constraint. A safe deep reinforcement learning version uses a convolutional neural network to estimate the risk levels and thus further improves the safety of the exploration and accelerates the learning speed for the learning agent. As a case study, the proposed reinforcement learning with safe exploration is applied in the anti-jamming robot communications. Experimental results show that the proposed algorithms can improve the jamming resistance of the robot and reduce the outage rate to enter the most dangerous states compared with the benchmark algorithms.

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aaai.v38i19.30094
P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Sumanta Dey + 2 more

Safe Reinforcement Learning (SRL) algorithms aim to learn a policy that maximizes the reward while satisfying the safety constraints. One of the challenges in SRL is that it is often difficult to balance the two objectives of reward maximization and safety constraint satisfaction. Existing algorithms utilize constraint optimization techniques like penalty-based, barrier penalty-based, and Lagrangian-based dual or primal policy optimizations methods. However, they suffer from training oscillations and approximation errors, which impact the overall learning objectives. This paper proposes the Permeable Penalty Barrier-based Policy Optimization (P2BPO) algorithm that addresses this issue by allowing a small fraction of penalty beyond the penalty barrier, and a parameter is used to control this permeability. In addition, an adaptive penalty parameter is used instead of a constant one, which is initialized with a low value and increased gradually as the agent violates the safety constraints. We have also provided a theoretical proof of the proposed method's performance guarantee bound, which ensures that P2BPO can learn a policy satisfying the safety constraints with high probability while achieving a higher expected reward. Furthermore, we compare P2BPO with other SRL algorithms on various SRL tasks and demonstrate that it achieves better rewards while adhering to the constraints.

  • Conference Article
  • 10.1145/3461778.3462029
Hammers for Robots: Designing Tools for Reinforcement Learning Agents
  • Jun 28, 2021
  • Matthew V Law + 5 more

In this paper we explore what role humans might play in designing tools for reinforcement learning (RL) agents to interact with the world. Recent work has explored RL methods that optimize a robot’s morphology while learning to control it, effectively dividing an RL agent’s environment into the external world and the agent’s interface with the world. Taking a user-centered design (UCD) approach, we explore the potential of a human, instead of an algorithm, redesigning the agent’s tool. Using UCD to design for a machine learning agent brings up several research questions, including what it means to understand an RL agent’s experience, beliefs, tendencies, and goals. After discussing these questions, we then present a system we developed to study humans designing a 2D racecar for an RL autonomous driver. We conclude with findings and insights from exploratory pilots with twelve users using this system.

  • Research Article
  • Cite Count Icon 4
  • 10.1109/tpami.2024.3443916
Safe Reinforcement Learning With Dual Robustness.
  • Dec 1, 2024
  • IEEE transactions on pattern analysis and machine intelligence
  • Zeyang Li + 4 more

Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.

  • Research Article
  • Cite Count Icon 116
  • 10.1609/aaai.v32i1.12107
Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning
  • Apr 26, 2018
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Nathan Fulton + 1 more

Formal verification provides a high degree of confidence in safe system operation, but only if reality matches the verified model. Although a good model will be accurate most of the time, even the best models are incomplete. This is especially true in Cyber-Physical Systems because high-fidelity physical models of systems are expensive to develop and often intractable to verify. Conversely, reinforcement learning-based controllers are lauded for their flexibility in unmodeled environments, but do not provide guarantees of safe operation. This paper presents an approach for provably safe learning that provides the best of both worlds: the exploration and optimization capabilities of learning along with the safety guarantees of formal verification. Our main insight is that formal verification combined with verified runtime monitoring can ensure the safety of a learning agent. Verification results are preserved whenever learning agents limit exploration within the confounds of verified control choices as long as observed reality comports with the model used for off-line verification. When a model violation is detected, the agent abandons efficiency and instead attempts to learn a control strategy that guides the agent to a modeled portion of the state space. We prove that our approach toward incorporating knowledge about safe control into learning systems preserves safety guarantees, and demonstrate that we retain the empirical performance benefits provided by reinforcement learning. We also explore various points in the design space for these justified speculative controllers in a simple model of adaptive cruise control model for autonomous cars.

  • Research Article
  • Cite Count Icon 3
  • 10.1002/acs.3326
Assured learning‐enabled autonomy: A metacognitive reinforcement learning framework
  • Sep 6, 2021
  • International Journal of Adaptive Control and Signal Processing
  • Aquib Mustafa + 3 more

SummaryReinforcement learning (RL) agents with pre‐specified reward functions cannot provide guaranteed safety across variety of circumstances that an uncertain system might encounter. To guarantee performance while assuring satisfaction of safety constraints across variety of circumstances, an assured autonomous control framework is presented in this article by empowering RL algorithms with metacognitive learning capabilities. More specifically, adapting the reward function parameters of the RL agent is performed in a metacognitive decision‐making layer to assure the feasibility of RL agent. That is, to assure that the learned policy by the RL agent satisfies safety constraints specified by signal temporal logic while achieving as much performance as possible. The metacognitive layer monitors any possible future safety violation under the actions of the RL agent and employs a higher‐layer Bayesian RL algorithm to proactively adapt the reward function for the lower‐layer RL agent. To minimize the higher‐layer Bayesian RL intervention, a fitness function is leveraged by the metacognitive layer as a metric to evaluate success of the lower‐layer RL agent in satisfaction of safety and liveness specifications, and the higher‐layer Bayesian RL intervenes only if there is a risk of lower‐layer RL failure. Finally, a simulation example is provided to validate the effectiveness of the proposed approach.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.