On-Policy Algorithms for Continual Reinforcement Learning (Student Abstract)

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Continual reinforcement learning (CRL) is the study of optimal strategies for maximizing rewards in sequential environments that change over time. This is particularly crucial in domains such as robotics, where the operational environment is inherently dynamic and subject to continual change. Nevertheless, research in this area has thus far concentrated on off-policy algorithms with replay buffers that are capable of amortizing the impact of distribution shifts. Such an approach is not feasible with on-policy reinforcement learning algorithms that learn solely from the data obtained from the current policy. In this paper, we examine the performance of proximal policy optimization (PPO), a prevalent on-policy reinforcement learning (RL) algorithm, in a classical CRL benchmark. Our findings suggest that the current methods are suboptimal in terms of average performance. Nevertheless, they demonstrate encouraging competitive outcomes with respect to forward transfer and forgetting metrics. This highlights the need for further research into continual on-policy reinforcement learning. The source code is available at https://github.com/Teddy298/continualworld-ppo.

Similar Papers
  • Research Article
  • Cite Count Icon 15
  • 10.1139/cjce-2017-0408
Continuous residual reinforcement learning for traffic signal control optimization
  • Jan 1, 2018
  • Canadian Journal of Civil Engineering
  • Mohammad Aslani + 2 more

Traffic signal control can be naturally regarded as a reinforcement learning problem. Unfortunately, it is one of the most difficult classes of reinforcement learning problems owing to its large state space. A straightforward approach to address this challenge is to control traffic signals based on continuous reinforcement learning. Although they have been successful in traffic signal control, they may become unstable and fail to converge to near-optimal solutions. We develop adaptive traffic signal controllers based on continuous residual reinforcement learning (CRL-TSC) that is more stable. The effect of three feature functions is empirically investigated in a microscopic traffic simulation. Furthermore, the effects of departing streets, more actions, and the use of the spatial distribution of the vehicles on the performance of CRL-TSCs are assessed. The results show that the best setup of the CRL-TSC leads to saving average travel time by 15% in comparison to an optimized fixed-time controller.

  • Research Article
  • 10.3390/math13162542
Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning
  • Aug 8, 2025
  • Mathematics
  • Dongjae Kim

Continual reinforcement learning (CRL) agents face significant challenges when encountering distributional shifts. This paper formalizes these shifts into two key scenarios, namely virtual drift (domain switches), where object semantics change (e.g., walls becoming lava), and concept drift (task switches), where the environment’s structure is reconfigured (e.g., moving from object navigation to a door key puzzle). This paper demonstrates that while conventional convolutional neural networks (CNNs) struggle to preserve relational knowledge during these transitions, graph convolutional networks (GCNs) can inherently mitigate catastrophic forgetting by encoding object interactions through explicit topological reasoning. A unified framework is proposed that integrates GCN-based state representation learning with a proximal policy optimization (PPO) agent. The GCN’s message-passing mechanism preserves invariant relational structures, which diminishes performance degradation during abrupt domain switches. Experiments conducted in procedurally generated MiniGrid environments show that the method significantly reduces catastrophic forgetting in domain switch scenarios. While showing comparable mean performance in task switch scenarios, our method demonstrates substantially lower performance variance (Levene’s test, p<1.0×10−10), indicating superior learning stability compared to CNN-based methods. By bridging graph representation learning with robust policy optimization in CRL, this research advances the stability of decision-making in dynamic environments and establishes GCNs as a principled alternative to CNNs for applications requiring stable, continual learning.

  • Research Article
  • 10.1371/journal.pone.0334219
Reinforcement learning for UAV flight controls: Evaluating continuous space reinforcement learning algorithms for fixed-wing UAVs
  • Oct 9, 2025
  • PLOS One
  • Hasan Raza Khanzada + 2 more

Flight controls are experiencing a major shift with the integration of reinforcement learning (RL). Recent studies have demonstrated the potential of RL to deliver robust and precise control across diverse applications, including the flight control of fixed-wing unmanned aerial vehicles (UAVs). However, a critical gap persists in the rigorous evaluation and comparative analysis of leading continuous-space RL algorithms. This paper aims to provide a comparative analysis of RL-driven flight control systems for fixed-wing UAVs in dynamic and uncertain environments. Five prominent RL algorithms that include Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO) and Soft Actor-Critic (SAC) are evaluated to determine their suitability for complex UAV flight dynamics, while highlighting their relative strengths and limitations. All the RL agents are trained in a same high fidelity simulation environment to control pitch, roll and heading of the UAV under varying flight conditions. The results demonstrate that RL algorithms outperformed the classical PID controllers in terms of stability, responsiveness and robustness, especially during environmental disturbances such as wind gusts. The comparative analysis reveals that the SAC algorithm achieves convergence in 400 episodes and maintains a steady-state error below 3%, offering the best trade-off among the evaluated RL algorithms. This analysis aims to provide valuable insight for the selection of suitable RL algorithm and their practical integration into modern UAV control systems.

  • Research Article
  • Cite Count Icon 16
  • 10.1007/s10489-020-01786-1
SLER: Self-generated long-term experience replay for continual reinforcement learning
  • Aug 7, 2020
  • Applied Intelligence
  • Chunmao Li + 4 more

Deep reinforcement learning has achieved significant success in various domains. However, it still faces a huge challenge when learning multiple tasks in sequence. This is because the interaction in a complex setting involves continual learning that results in the change in data distributions over time. A continual learning system should ensure that the agent acquires new knowledge without forgetting the previous one. However, catastrophic forgetting may occur as the new experience can overwrite previous experience due to limited memory size. The dual experience replay algorithm which retains previous experience is widely applied to reduce forgetting, but it cannot be applied in scalable tasks when the memory size is constrained. To alleviate the constrained by the memory size, we propose a new continual reinforcement learning algorithm called Self-generated Long-term Experience Replay (SLER). Our method is different from the standard dual experience replay algorithm, which uses short-term experience replay to retain current task experience, and the long-term experience replay retains all past tasks’ experience to achieve continual learning. In this paper, we first trained an environment sample model called Experience Replay Mode (ERM) to generate the simulated state sequence of the previous tasks for knowledge retention. Then combined the ERM with the experience of the new task to generate the simulation experience all previous tasks to alleviate forgetting. Our method can effectively decrease the requirement of memory size in multiple tasks, reinforcement learning. We show that our method in StarCraft II and the GridWorld environments performs better than the state-of-the-art deep learning method and achieve a comparable result to the dual experience replay method, which retains the experience of all the tasks.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.1177/1729881420911491
Continuous reinforcement learning to adapt multi-objective optimization online for robot motion
  • Mar 1, 2020
  • International Journal of Advanced Robotic Systems
  • Kai Zhang + 3 more

This article introduces a continuous reinforcement learning framework to enable online adaptation of multi-objective optimization functions for guiding a mobile robot to move in changing dynamic environments. The robot with this framework can continuously learn from multiple or changing environments where it encounters different numbers of obstacles moving in unknown ways at different times. Using both planned trajectories from a real-time motion planner and already executed trajectories as feedback observations, our reinforcement learning agent enables the robot to adapt motion behaviors to environmental changes. The agent contains a Q network connected to a long short-term memory network. The proposed framework is tested in both simulations and real robot experiments over various, dynamically varied task environments. The results show the efficacy of online continuous reinforcement learning for quick adaption to different, unknown, and dynamic environments.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icdmw58026.2022.00011
Streaming Traffic Flow Prediction Based on Continuous Reinforcement Learning
  • Nov 1, 2022
  • Yanan Xiao + 5 more

Traffic flow prediction is an important part of smart transportation. The goal is to predict future traffic conditions based on historical data recorded by sensors and the traffic net-work. As the city continues to build, parts of the transportation network will be added or modified. How to accurately predict expanding and evolving long-term streaming networks is of great significance. To this end, we propose a new simulation-based criterion that considers teaching autonomous agents to mimic sensor patterns, planning their next visit based on the sensor's profile (e.g., traffic, speed, occupancy). The data recorded by the sensor is most accurate when the agent can perfectly simulate the sensor's activity pattern. We propose to formulate the problem as a continuous reinforcement learning task, where the agent is the next flow value predictor, the action is the next time-series flow value in the sensor, and the environment state is a dynamically fused representation of the sensor and transportation network. Actions taken by the agent change the environment, which in turn forces the agent's mode to update, while the agent further explores changes in the dynamic traffic network, which helps the agent predict its next visit more accurately. Therefore, we develop a strategy in which sensors and traffic networks update each other and incorporate temporal context to quantify state representations evolving over time. Along these lines, we propose streaming traffic flow prediction based on continuous reinforcement learning model (ST-CRL), a kind of predictive model based on reinforcement learning and continuous learning, and an analytical algorithm based on KL divergence that cleverly incorporates long-term novel patterns into model induction. Second, we introduce a prioritized experience replay strategy to consolidate and aggregate previously learned core knowledge into the model. The proposed model is able to continuously learn and predict as the traffic flow network expands and evolves over time. Extensive experiments show that the algorithm has great potential in predicting long-term streaming media networks, while achieving data privacy protection to a certain extent.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1007/s10994-022-06283-9
Hierarchically structured task-agnostic continual learning
  • Dec 28, 2022
  • Machine Learning
  • Heinke Hihn + 1 more

One notable weakness of current machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We derive this principle from a Bayesian perspective and show its connections to previous approaches to continual learning. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Equipped with a diverse and specialized set of parameters, each path can be regarded as a distinct sub-network that learns to solve tasks. To improve expert allocation, we introduce diversity objectives, which we evaluate in additional ablation studies. Importantly, our approach can operate in a task-agnostic way, i.e., it does not require task-specific knowledge, as is the case with many existing continual learning algorithms. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method on continual reinforcement learning and variants of the MNIST, CIFAR-10, and CIFAR-100 datasets.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 17
  • 10.1109/glocom.2014.7037498
Adaptive proportional fair parameterization based LTE scheduling using continuous actor-critic reinforcement learning
  • Dec 1, 2014
  • Ioan Sorin Comsa + 5 more

Maintaining a desired trade-off performance between system throughput maximization and user fairness satisfaction constitutes a problem that is still far from being solved. In LTE systems, different tradeoff levels can be obtained by using a proper parameterization of the Generalized Proportional Fair (GPF) scheduling rule. Our approach is able to find the best parameterization policy that maximizes the system throughput under different fairness constraints imposed by the scheduler state. The proposed method adapts and refines the policy at each Transmission Time Interval (TTI) by using the Multi-Layer Perceptron Neural Network (MLPNN) as a non-linear function approximation between the continuous scheduler state and the optimal GPF parameter(s). The MLPNN function generalization is trained based on Continuous Actor-Critic Learning Automata Reinforcement Learning (CACLA RL). The double GPF parameterization optimization problem is addressed by using CACLA RL with two continuous actions (CACLA-2). Five reinforcement learning algorithms as simple parameterization techniques are compared against the novel technology. Simulation results indicate that CACLA-2 performs much better than any of other candidates that adjust only one scheduling parameter such as CACLA-1. CACLA-2 outperforms CACLA-1 by reducing the percentage of TTIs when the system is considered unfair. Being able to attenuate the fluctuations of the obtained policy, CACLA-2 achieves enhanced throughput gain when severe changes in the scheduling environment occur, maintaining in the same time the fairness optimality condition.

  • Research Article
  • Cite Count Icon 33
  • 10.1016/j.artmed.2021.102227
Continuous action deep reinforcement learning for propofol dosing during general anesthesia
  • Dec 2, 2021
  • Artificial Intelligence in Medicine
  • Gabriel Schamberg + 4 more

PurposeAnesthesiologists simultaneously manage several aspects of patient care during general anesthesia. Automating administration of hypnotic agents could enable more precise control of a patient's level of unconsciousness and enable anesthesiologists to focus on the most critical aspects of patient care. Reinforcement learning (RL) algorithms can be used to fit a mapping from patient state to a medication regimen. These algorithms can learn complex control policies that, when paired with modern techniques for promoting model interpretability, offer a promising approach for developing a clinically viable system for automated anesthestic drug delivery. MethodsWe expand on our prior work applying deep RL to automated anesthetic dosing by now using a continuous-action model based on the actor-critic RL paradigm. The proposed RL agent is composed of a policy network that maps observed anesthetic states to a continuous probability density over propofol-infusion rates and a value network that estimates the favorability of observed states. We train and test three versions of the RL agent using varied reward functions. The agent is trained using simulated pharmacokinetic/pharmacodynamic models with randomized parameters to ensure robustness to patient variability. The model is tested on simulations and retrospectively on nine general anesthesia cases collected in the operating room. We utilize Shapley additive explanations to gain an understanding of the factors with the greatest influence over the agent's decision-making. ResultsThe deep RL agent significantly outperformed a proportional-integral-derivative controller (median episode median absolute performance error 1.9% ± 1.8 and 3.1% ± 1.1). The model that was rewarded for minimizing total doses performed the best across simulated patient demographics (median episode median performance error 1.1% ± 0.5). When run on real-world clinical datasets, the agent recommended doses that were consistent with those administered by the anesthesiologist. ConclusionsThe proposed approach marks the first fully continuous deep RL algorithm for automating anesthestic drug dosing. The reward function used by the RL training algorithm can be flexibly designed for desirable practices (e.g. use less anesthetic) and bolstered performances. Through careful analysis of the learned policies, techniques for interpreting dosing decisions, and testing on clinical data, we confirm that the agent's anesthetic dosing is consistent with our understanding of best-practices in anesthesia care.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 42
  • 10.1007/s10514-022-10034-z
Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning
  • Feb 9, 2022
  • Autonomous Robots
  • Asad Ali Shahid + 3 more

This paper presents a learning-based method that uses simulation data to learn an object manipulation task using two model-free reinforcement learning (RL) algorithms. The learning performance is compared across on-policy and off-policy algorithms: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). In order to accelerate the learning process, the fine-tuning procedure is proposed that demonstrates the continuous adaptation of on-policy RL to new environments, allowing the learned policy to adapt and execute the (partially) modified task. A dense reward function is designed for the task to enable an efficient learning of the agent. A grasping task involving a Franka Emika Panda manipulator is considered as the reference task to be learned. The learned control policy is demonstrated to be generalizable across multiple object geometries and initial robot/parts configurations. The approach is finally tested on a real Franka Emika Panda robot, showing the possibility to transfer the learned behavior from simulation. Experimental results show 100% of successful grasping tasks, making the proposed approach applicable to real applications.

  • Research Article
  • 10.3390/en17235876
Exploring the Preference for Discrete over Continuous Reinforcement Learning in Energy Storage Arbitrage
  • Nov 22, 2024
  • Energies
  • Jaeik Jeong + 2 more

In recent research addressing energy arbitrage with energy storage systems (ESSs), discrete reinforcement learning (RL) has often been employed, while the underlying reasons for this preference have not been explicitly clarified. This paper aims to elucidate why discrete RL tends to be more suitable than continuous RL for energy arbitrage problems. When using continuous RL, the charging and discharging actions determined by the agent often exceed the physical limits of the ESS, necessitating clipping to the boundary values. This introduces a critical issue where the learned actions become stuck at the state of charge (SoC) boundaries, hindering effective learning. Although recent advancements in constrained RL offer potential solutions, their application often results in overly conservative policies, preventing the full utilization of ESS capabilities. In contrast, discrete RL, while lacking in granular control, successfully avoids these two key challenges, as demonstrated by simulation results showing superior performance. Additionally, it was found that, due to its characteristics, discrete RL more easily drives the ESS towards fully charged or fully discharged states, thereby increasing the utilization of the storage system. Our findings provide a solid justification for the prevalent use of discrete RL in recent studies involving energy arbitrage with ESSs, offering new insights into the strategic selection of RL methods in this domain. Looking ahead, improving performance will require further advancements in continuous RL methods. This study provides valuable direction for future research in continuous RL, highlighting the challenges and potential strategies to overcome them to fully exploit ESS capabilities.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/pr7080546
Multivariable System Identification Method Based on Continuous Action Reinforcement Learning Automata
  • Aug 17, 2019
  • Processes
  • Meiying Jiang + 1 more

In this work, a closed-loop identification method based on a reinforcement learning algorithm is proposed for multiple-input multiple-output (MIMO) systems. This method could be an attractive alternative solution to the problem that the current frequency-domain identification algorithms are usually dependent on the attenuation factor. With this method, after continuously interacting with the environment, the optimal attenuation factor can be identified by the continuous action reinforcement learning automata (CARLA), and then the corresponding parameters could be estimated in the end. Moreover, the proposed method could be applied to time-varying systems online due to its online learning ability. The simulation results suggest that the presented approach can meet the requirement of identification accuracy in both square and non-square systems.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/cac53003.2021.9727354
Mapless navigation based on continuous deep reinforcement learning
  • Oct 22, 2021
  • Xing Chen + 2 more

This paper proposes a map-free navigation scheme based on continuous deep reinforcement learning to solve the problem that robots cannot flexibly avoid obstacles and navigate in a dynamic environment. The reinforcement learning algorithm used in this article is near-end strategy optimization (proximal strategy optimization, PPO), and the benchmark algorithm is the discrete deep reinforcement learning algorithm Deep Q network algorithm (Deep Q network, DQN).Experiments in the Gazebo simulation environment prove that the training efficiency and success rate of the PPO algorithm are much higher than that of the DQN algorithm. In this paper, the trained strategy model in the simulation environment is directly transplanted to the actual robot. The experimental results verify that the physical robot can have good navigation and obstacle avoidance capabilities without training again. The tested single-target navigation success rate is 80%, and the multi-target navigation success rate is 70%.

  • Conference Article
  • Cite Count Icon 7
  • 10.1145/3319619.3322044
Towards continual reinforcement learning through evolutionary meta-learning
  • Jul 13, 2019
  • Djordje Grbic + 1 more

In continual learning, an agent is exposed to a changing environment, requiring it to adapt during execution time. While traditional reinforcement learning (RL) methods have shown impressive results in various domains, there has been less progress in addressing the challenge of continual learning. Current RL approaches do not allow the agent to adapt during execution but only during a dedicated training phase. Here we study the problem of continual learning in a 2D bipedal walker domain, in which the legs of the walker grow over its lifetime, requiring the agent to adapt. The introduced approach combines neuroevolution, to determine the starting weights of a deep neural network, and a version of deep reinforcement learning that is continually running during execution time. The proof-of-concept results show that the combined approach gives a better generalisation performance when compared to evolution or reinforcement learning alone. The hybridization of reinforcement learning and evolution opens up exciting new research directions for continually learning agents that can benefit from suitable priors determined by an evolutionary process.

  • Book Chapter
  • 10.1007/978-3-031-06427-2_44
Avalanche RL: A Continual Reinforcement Learning Library
  • Jan 1, 2022
  • Nicoló Lucchesi + 3 more

Continual Reinforcement Learning (CRL) is a challenging setting where an agent learns to interact with an environment that is constantly changing over time (the stream of experiences). In this paper, we describe Avalanche RL, a library for Continual Reinforcement Learning which allows users to easily train agents on a continuous stream of tasks. Avalanche RL is based on PyTorch [23] and supports any OpenAI Gym [4] environment. Its design is based on Avalanche [16], one of the most popular continual learning libraries, which allow us to reuse a large number of continual learning strategies and improve the interaction between reinforcement learning and continual learning researchers. Additionally, we propose Continual Habitat-Lab, a novel benchmark and a high-level library which enables the usage of the photorealistic simulator Habitat-Sim [28] for CRL research. Overall, Avalanche RL attempts to unify under a common framework continual reinforcement learning applications, which we hope will foster the growth of the field.KeywordsContinual learningReinforcement learningReproducibility

More from: Proceedings of the AAAI Conference on Artificial Intelligence
  • Research Article
  • 10.1609/aaai.v39i14.33631
An Automatic Sound and Complete Abstraction Method for Generalized Planning with Baggable Types
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Hao Dong + 3 more

  • Research Article
  • 10.1609/aaai.v39i5.32551
Progressive Distribution Matching for Federated Semi-Supervised Learning
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Dongping Liao + 3 more

  • Open Access Icon
  • Research Article
  • 10.1609/aaai.v39i27.35089
Trustworthy AI Meets Educational Assessment: Challenges and Opportunities
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Sheng Li

  • Research Article
  • 10.1609/aaai.v39i9.33067
World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous Driving
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Mingliang Zhai + 9 more

  • Research Article
  • 10.1609/aaai.v39i28.35280
Audience Engagement with Political Messaging on YouTube Shorts (Student Abstract)
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Omkar Narkar + 2 more

  • Open Access Icon
  • Research Article
  • 10.1609/aaai.v39i23.34664
Automated Creation of Reusable and Diverse Toolsets for Enhancing LLM Reasoning
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Zhiyuan Ma + 5 more

  • Open Access Icon
  • Research Article
  • 10.1609/aaai.v39i21.34421
HVAdam: A Full-Dimension Adaptive Optimizer
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yiheng Zhang + 6 more

  • Open Access Icon
  • Research Article
  • 10.1609/aaai.v39i7.32789
3D²-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Zichen Tang + 4 more

  • Open Access Icon
  • Research Article
  • 10.1609/aaai.v39i10.33150
Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Maoji Zheng + 5 more

  • Research Article
  • 10.1609/aaai.v39i28.35277
LoRA Unlearns More and Retains More (Student Abstract)
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Atharv Mittal

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon