Online Reward Poisoning in Reinforcement Learning with Convergence Guarantee

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Online Reward Poisoning in Reinforcement Learning with Convergence Guarantee

Similar Papers
  • Research Article
  • 10.4271/01-18-01-0006
Fault-Tolerant Control of a Quadcopter Using Reinforcement Learning
  • Mar 3, 2025
  • SAE International Journal of Aerospace
  • Muzaffar Habib Qureshi + 2 more

<div>This study presents a novel reinforcement learning (RL)-based control framework aimed at enhancing the safety and robustness of the quadcopter, with a specific focus on resilience to in-flight one propeller failure. This study addresses the critical need of a robust control strategy for maintaining a desired altitude for the quadcopter to save the hardware and the payload in physical applications. The proposed framework investigates two RL methodologies, dynamic programming (DP) and deep deterministic policy gradient (DDPG), to overcome the challenges posed by the rotor failure mechanism of the quadcopter. DP, a model-based approach, is leveraged for its convergence guarantees, despite high computational demands, whereas DDPG, a model-free technique, facilitates rapid computation but with constraints on solution duration. The research challenge arises from training RL algorithms on large dimension and action domains. With modifications to the existing DP and DDPG algorithms, the controllers were trained to not only cater for large continuous state and action domain but also achieve a desired state after an in-flight propeller failure. To verify the robustness of the proposed control framework, extensive simulations were conducted in a MATLAB environment across various initial conditions and underscoring their viability for mission-critical quadcopter applications. A comparative analysis was performed between both RL algorithms and their potential for applications in faulty aerial systems.</div>

  • Research Article
  • Cite Count Icon 2
  • 10.1609/aaai.v36i6.20602
Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes
  • Jun 28, 2022
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Florent Delgrange + 2 more

We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Additionally, we obtain a distilled version of the policy for the latent model.

  • Conference Article
  • Cite Count Icon 2
  • 10.5555/3018874.3018878
A scalable parallel Q-Learning algorithm for resource constrained decentralized computing environments
  • Nov 13, 2016
  • Miguel Camelo + 2 more

The Internet of Things (IoT) is more and more becoming a platform for mission critical applications with stringent requirements in terms of response time and mobility. Therefore, a centralized High Performance Computing (HPC) environment is often not suitable or simply non-existing. Instead, there is a need for a scalable HPC model that supports the deployment of applications on the decentralized but resource constrained devices of the IoT. Recently, Reinforcement Learning (RL) algorithms have been used for decision making within applications by directly interacting with the environment. However, most RL algorithms are designed for centralized environments and are time and resource consuming. Therefore, they are not applicable to such constrained decentralized computing environments. In this paper, we propose a scalable Parallel Q-Learning (PQL) algorithm for resource constrained environments. By combining a table partition strategy together with a co-allocation of both processing and storage, we can significantly reduce the individual resource cost and, at the same time, guarantee convergence and minimize the communication cost. Experimental results show that our algorithm reduces the required training in proportion of the number of Q-Learning agents and, in terms of execution time, it is up to 24 times faster than several well-known PQL algorithms.

  • Research Article
  • Cite Count Icon 8
  • 10.1002/rnc.5624
Suboptimal reduced control of unknown nonlinear singularly perturbed systems via reinforcement learning
  • Jun 1, 2021
  • International Journal of Robust and Nonlinear Control
  • Xiaomin Liu + 4 more

In this paper, a suboptimal reduced control method is proposed for a class of nonlinear singularly perturbed systems (SPSs) with unknown dynamics. By using singular perturbation theory, the original system is reduced to a reduced system, by which a policy iterative method is proposed to solve the corresponding reduced Hamilton–Jacobi–Bellman (HJB) equation with convergence guaranteed. A reinforcement learning (RL) algorithm is proposed to implement the policy iterative method without using any knowledge of the system dynamics. In the RL algorithm, the unmeasurable state of the virtual reduced system is reconstructed by the slow state measurements of the original system, the controller and cost function are approximated by actor‐critic neural networks (NNs) and the method of weighted residuals is utilized to update the NN weights. The influence introduced by state reconstruction error and NN function approximation on the convergence, suboptimality of the reduced controller and stability of the closed‐loop SPSs are rigorously analyzed. Finally, the effectiveness of our proposed method is illustrated by examples.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ijcnn55064.2022.9892099
Intelligent Robust Control for Second-Order Non-Linear Systems with Smart Gain Tuning based on Reinforcement Learning
  • Jul 18, 2022
  • Adalberto I S Oliveira + 2 more

Reinforcement Learning (RL) is a machine learning technique that deals with linear and nonlinear systems without necessarily knowing their exact dynamic models. It can learn to bring the system states directly and quickly to any reachable point, but has no convergence guarantees. On the other hand, the Sliding Mode Control (SMC) approach is a robust control technique based on variable structure systems that can handle parametric uncertainties and external disturbances, provably driving system trajectories to the vicinity of the origin. However, tuning the control gains and the sliding surface parameters is not straightforward. This work presents a methodology for merging RL and SMC approaches in a unified intelligent control technique to ensure system stability and robustness to perturbations and modeling inaccuracies. In contrast to previous work, the control gains are tuned using the same technique as the RL actor. The newly proposed controller is applied to the swing up and stabilization of the inverted pendulum system for validation purposes. Its performance is then compared to its component parts: Twin Delayed DDPG (TD3) and first-order SMC. Simulation results are presented to demonstrate the effectiveness and feasibility of the proposed methodology.

  • Conference Article
  • 10.1109/icarm52023.2021.9536204
Learning Smooth and Omnidirectional Locomotion for Quadruped Robots
  • Jul 3, 2021
  • Jiaxi Wu + 5 more

It often takes a lot of trial and error to get a quadruped robot to learn a proper and natural gait directly through reinforcement learning. Moreover, it requires plenty of attempts and clever reward settings to learn appropriate locomotion. However, the success rate of network convergence is still relatively low. In this paper, the referred trajectory, inverse kinematics, and transformation loss are integrated into the training process of reinforcement learning as prior knowledge. Therefore reinforcement learning only needs to search for the optimal solution around the referred trajectory, making it easier to find the appropriate locomotion and guarantee convergence. When testing, a PD controller is fused into the trained model to reduce the velocity following error. Based on the above ideas, we propose two control framework - single closed-loop and double closed-loop. And their effectiveness is proved through experiments. It can efficiently help quadruped robots learn appropriate gait and realize smooth and omnidirectional locomotion, which all learned in one model.

  • Research Article
  • Cite Count Icon 98
  • 10.7939/r30q50
Gradient temporal-difference learning algorithms
  • Jan 1, 2011
  • Richard S Sutton + 1 more

We present a new family of gradient temporal-difference (TD) learning methods with function approximation whose complexity, both in terms of memory and per-time-step computation, scales linearly with the number of learning parameters. TD methods are powerful prediction techniques, and with function approximation form a core part of modern reinforcement learning (RL). However, the most popular TD methods, such as TD(λ), Q-learning and Sarsa, may become unstable and diverge when combined with function approximation. In particular, convergence cannot be guaranteed for these methods when they are used with off-policy training. Off-policy training—training on data from one policy in order to learn the value of another—is useful in dealing with the exploration-exploitation tradeoff. As function approximation is needed for large-scale applications, this stability problem is a key impediment to extending TD methods to real-world large-scale problems. The new family of TD algorithms, also called gradient-TD methods, are based on stochastic gradient-descent in a Bellman error objective function. We provide convergence proofs for general settings, including off-policy learning with unrestricted features, and nonlinear function approximation. Gradient-TD algorithms are on-line, incremental, and extend conventional TD methods to off-policy learning while retaining a convergence guarantee and only doubling computational requirements. Our empirical results suggest that many members of the gradient-TD algorithms may be slower than conventional TD on the subset of training cases in which conventional TD methods are sound. Our latest gradient-TD algorithms are “hybrid” in that they become equivalent to conventional TD—in terms of asymptotic rate of convergence—in on-policy problems.

  • Research Article
  • Cite Count Icon 258
  • 10.1109/tnn.2007.899161
Kernel-Based Least Squares Policy Iteration for Reinforcement Learning
  • Jul 1, 2007
  • IEEE Transactions on Neural Networks
  • Xin Xu + 2 more

In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.

  • Book Chapter
  • Cite Count Icon 13
  • 10.5772/5279
Decentralized Reinforcement Learning for the Online Optimization of Distributed Systems
  • Jan 1, 2008
  • Jim Dowling + 1 more

This research was supported by a Marie Curie Intra-European Fellowship within the 6th European Community Framework Programme. The authors would like to thank Jan Sacha for an implementation of CRL in Java on which the experiments in this paper are based.

  • Research Article
  • Cite Count Icon 32
  • 10.1609/aaai.v35i10.17062
Decentralized Policy Gradient Descent Ascent for Safe Multi-Agent Reinforcement Learning
  • May 18, 2021
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Songtao Lu + 4 more

This paper deals with distributed reinforcement learning problems with safety constraints. In particular, we consider that a team of agents cooperate in a shared environment, where each agent has its individual reward function and safety constraints that involve all agents' joint actions. As such, the agents aim to maximize the team-average long-term return, subject to all the safety constraints. More intriguingly, no central controller is assumed to coordinate the agents, and both the rewards and constraints are only known to each agent locally/privately. Instead, the agents are connected by a peer-to-peer communication network to share information with their neighbors. In this work, we first formulate this problem as a distributed constrained Markov decision process (D-CMDP) with networked agents. Then, we propose a decentralized policy gradient (PG) method, Safe Dec-PG, to perform policy optimization based on this D-CMDP model over a network. Convergence guarantees, together with numerical results, showcase the superiority of the proposed algorithm. To the best of our knowledge, this is the first decentralized PG algorithm that accounts for the coupled safety constraints with a quantifiable convergence rate in multi-agent reinforcement learning. Finally, we emphasize that our algorithm is also novel in solving a class of decentralized stochastic nonconvex-concave minimax optimization problems, where both the algorithm design and corresponding theoretical analysis are of independent interest.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.engappai.2023.106462
Integrated learning self-triggered control for model-free continuous-time systems with convergence guarantees
  • Jun 3, 2023
  • Engineering Applications of Artificial Intelligence
  • Haiying Wan + 4 more

Integrated learning self-triggered control for model-free continuous-time systems with convergence guarantees

  • Research Article
  • Cite Count Icon 184
  • 10.1007/s10479-005-5732-z
Basis Function Adaptation in Temporal Difference Reinforcement Learning
  • Feb 1, 2005
  • Annals of Operations Research
  • Ishai Menache + 2 more

Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.

  • Research Article
  • 10.1613/jair.1.13854
Low-Rank Representation of Reinforcement Learning Policies
  • Oct 27, 2022
  • Journal of Artificial Intelligence Research
  • Bogdan Mazoure + 6 more

We propose a general framework for policy representation for reinforcement learning tasks. This framework involves finding a low-dimensional embedding of the policy on a reproducing kernel Hilbert space (RKHS). The usage of RKHS based methods allows us to derive strong theoretical guarantees on the expected return of the reconstructed policy. Such guarantees are typically lacking in black-box models, but are very desirable in tasks requiring stability and convergence guarantees. We conduct several experiments on classic RL domains. The results confirm that the policies can be robustly represented in a low-dimensional space while the embedded policy incurs almost no decrease in returns.

  • Research Article
  • Cite Count Icon 38
  • 10.1109/jas.2014.7004680
Off-policy reinforcement learning with Gaussian processes
  • Jul 1, 2014
  • IEEE/CAA Journal of Automatica Sinica
  • Girish Chowdhary + 5 more

An off-policy Bayesian nonparameteric approximate reinforcement learning framework, termed as GPQ, that employs a Gaussian processes (GP) model of the value (Q) function is presented in both the batch and online settings. Sufficient conditions on GP hyperparameter selection are established to guarantee convergence of off-policy GPQ in the batch setting, and theoretical and practical extensions are provided for the online case. Empirical results demonstrate GPQ has competitive learning speed in addition to its convergence guarantees and its ability to automatically choose its own bases locations.

  • Research Article
  • Cite Count Icon 22
  • 10.1109/tits.2021.3091014
Using Reinforcement Learning to Control Traffic Signals in a Real-World Scenario: An Approach Based on Linear Function Approximation
  • Jul 1, 2022
  • IEEE Transactions on Intelligent Transportation Systems
  • Lucas N Alegre + 2 more

Reinforcement learning is an efficient, widely used machine learning technique that performs well in problems with a reasonable number of states and actions. This is rarely the case regarding control-related problems, as for instance controlling traffic signals, where the state space can be very large. One way to deal with the curse of dimensionality is to use generalization techniques such as function approximation. In this paper, a linear function approximation is used by traffic signal agents in a network of signalized intersections. Specifically, a true online SARSA <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$(\lambda)$ </tex-math></inline-formula> algorithm with Fourier basis functions (TOS( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> )-FB) is employed. This method has the advantage of having convergence guarantees and error bounds, a drawback of non-linear function approximation. In order to evaluate TOS( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> )-FB, we perform experiments in variations of an isolated intersection scenario and a scenario of the city of Cottbus, Germany, with 22 signalized intersections, implemented in MATSim. We compare our results not only to fixed-time controllers, but also to a state-of-the-art rule-based adaptive method, showing that TOS( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> )-FB shows a performance that is highly superior to the fixed-time, while also being at least as efficient as the rule-based approach. For more than half of the intersections, our approach leads to less congestion and delay, without the need for the knowledge that underlies the rule-based approach.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.