Published in last 50 years
Articles published on Continuous Reinforcement Learning
- Research Article
- 10.1371/journal.pone.0334219
- Oct 9, 2025
- PLOS One
- Hasan Raza Khanzada + 2 more
Flight controls are experiencing a major shift with the integration of reinforcement learning (RL). Recent studies have demonstrated the potential of RL to deliver robust and precise control across diverse applications, including the flight control of fixed-wing unmanned aerial vehicles (UAVs). However, a critical gap persists in the rigorous evaluation and comparative analysis of leading continuous-space RL algorithms. This paper aims to provide a comparative analysis of RL-driven flight control systems for fixed-wing UAVs in dynamic and uncertain environments. Five prominent RL algorithms that include Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO) and Soft Actor-Critic (SAC) are evaluated to determine their suitability for complex UAV flight dynamics, while highlighting their relative strengths and limitations. All the RL agents are trained in a same high fidelity simulation environment to control pitch, roll and heading of the UAV under varying flight conditions. The results demonstrate that RL algorithms outperformed the classical PID controllers in terms of stability, responsiveness and robustness, especially during environmental disturbances such as wind gusts. The comparative analysis reveals that the SAC algorithm achieves convergence in 400 episodes and maintains a steady-state error below 3%, offering the best trade-off among the evaluated RL algorithms. This analysis aims to provide valuable insight for the selection of suitable RL algorithm and their practical integration into modern UAV control systems.
- Research Article
- 10.1177/09596518251350353
- Sep 3, 2025
- Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering
- Anselmo Parnada + 4 more
Reinforcement Learning (RL) has been considered a promising method to enable the automation of contact-rich manipulation tasks, which can increase capabilities for industrial automation. RL facilitates autonomous agents’ learning to solve environments with complex dynamics with little human intervention, making it easier to implement control strategies for contact-rich tasks compared to traditional control approaches. Further, RL-based robotic control has the potential to transfer policies between task variations, significantly improving scalability compared to existing methods. However, RL is currently inviable for wider adoption due to its relatively high implementation costs and safety issues, so current research has been focused on addressing these issues. This paper comprehensively reviewed recently developed techniques to improve cost and safety for RL in contact-rich robotic manipulation. Techniques were organized by their approach, and their impact was analysed. It was found that current research efforts have significantly improved the cost and safety of RL-based control for contact-rich tasks, but further improvements can be made by progressing research towards improving knowledge transfer between tasks, improving inter-robot policy transfer and facilitating real-world and continual RL. The identified directions for further research set the stage for future developments in more versatile and cost-effective RL-based control for contact-rich robotic manipulation in future industrial automation applications.
- Research Article
- 10.3724/2096-7004.di.2025.0110
- Sep 1, 2025
- Data Intelligence
- Zeming Yang + 4 more
Enhancing Grammar Error Correction with Continuous Knowledge Distillation and Reinforcement Learning
- Research Article
- 10.1016/j.egyai.2025.100541
- Sep 1, 2025
- Energy and AI
- Sarvar Hussain Nengroo + 4 more
Continuous variable quantum reinforcement learning for HVAC control and power management in residential building
- Research Article
- 10.3390/math13162542
- Aug 8, 2025
- Mathematics
- Dongjae Kim
Continual reinforcement learning (CRL) agents face significant challenges when encountering distributional shifts. This paper formalizes these shifts into two key scenarios, namely virtual drift (domain switches), where object semantics change (e.g., walls becoming lava), and concept drift (task switches), where the environment’s structure is reconfigured (e.g., moving from object navigation to a door key puzzle). This paper demonstrates that while conventional convolutional neural networks (CNNs) struggle to preserve relational knowledge during these transitions, graph convolutional networks (GCNs) can inherently mitigate catastrophic forgetting by encoding object interactions through explicit topological reasoning. A unified framework is proposed that integrates GCN-based state representation learning with a proximal policy optimization (PPO) agent. The GCN’s message-passing mechanism preserves invariant relational structures, which diminishes performance degradation during abrupt domain switches. Experiments conducted in procedurally generated MiniGrid environments show that the method significantly reduces catastrophic forgetting in domain switch scenarios. While showing comparable mean performance in task switch scenarios, our method demonstrates substantially lower performance variance (Levene’s test, p<1.0×10−10), indicating superior learning stability compared to CNN-based methods. By bridging graph representation learning with robust policy optimization in CRL, this research advances the stability of decision-making in dynamic environments and establishes GCNs as a principled alternative to CNNs for applications requiring stable, continual learning.
- Research Article
- 10.3390/s25164895
- Aug 8, 2025
- Sensors (Basel, Switzerland)
- Yanhui Liu + 3 more
The extensive deployment of quadrotors in complex environmental missions has revealed a critical challenge: degradation of trajectory tracking accuracy due to time-varying wind disturbances. Conventional model-based controllers struggle to adapt to nonlinear wind field dynamics, while data-driven approaches often suffer from catastrophic forgetting that compromises environmental adaptability. This paper proposes a reinforcement learning framework with continual adaptation capabilities to enhance robust tracking performance for quadrotors operating in dynamic wind fields. We develop a continual reinforcement learning framework integrating continual backpropagation algorithms with reinforcement learning. Initially, a foundation model is trained in wind-free conditions. When wind disturbance intensity undergoes gradual variations, a neuron utility assessment mechanism dynamically resets inefficient neurons to maintain network plasticity. Concurrently, a multi-objective reward function is designed to improve both training precision and efficiency. The Gazebo/PX4 simulation platform was utilized to validate the wind disturbance stepwise growth and stochastic variations. This approach demonstrated a reduction in the root mean square error of trajectory tracking when compared to the standard PPO algorithm. The proposed framework resolves the plasticity loss problem in deep reinforcement learning through structured neuron resetting, significantly enhancing the continual adaptation capabilities of quadrotors in dynamic wind fields.
- Research Article
- 10.1109/mcom.001.2400526
- Aug 1, 2025
- IEEE Communications Magazine
- Masoud Shokrnezhad + 1 more
An Autonomous Network Orchestration Framework Integrating Large Language Models with Continual Reinforcement Learning
- Research Article
- 10.1016/j.engappai.2025.110676
- Jul 1, 2025
- Engineering Applications of Artificial Intelligence
- Jiawei Lin + 7 more
Continuous reinforcement learning via advantage value difference reward shaping: A proximal policy optimization perspective
- Research Article
- 10.3390/telecom6030046
- Jul 1, 2025
- Telecom
- Mays A Mawlood + 1 more
The rapid growth of high-quality telecommunications demands enhanced queueing system performance. Traditional bandwidth distribution often struggles to adapt to dynamic changes, network conditions, and erratic traffic patterns. Internet traffic fluctuates over time, causing resource underutilization. To address these challenges, this paper proposes a new adaptive algorithm called Weighted Fair Queues continual Deep Reinforcement Learning (WFQ continual-DRL), which integrates the advanced deep reinforcement learning Soft Actor-Critic (SAC) algorithm with the Elastic Weight Consolidation (EWC) approach. This technique is designed to overcome neural networks’ catastrophic forgetting, thereby enhancing network routers’ dynamic bandwidth allocation. The agent is trained to allocate bandwidth weights for multiple queues dynamically by interacting with the environment to observe queue lengths. The performance of the proposed adaptive algorithm was evaluated for eight queues until it expanded to twelve-queue systems. The model achieved higher cumulative rewards as compared to previous studies, indicating improved overall performance. The values of the Mean Squared Error (MSE) and Mean Absolute Error (MAE) decreased, suggesting effectively optimized bandwidth allocation. Reducing Root Mean Square Error (RMSE) indicated improved prediction accuracy and enhanced fairness computed by Jain’s index. The proposed algorithm was validated by employing real-world network traffic data, ensuring a robust model under dynamic queuing requirements.
- Research Article
- 10.30574/wjarr.2025.26.3.2333
- Jun 30, 2025
- World Journal of Advanced Research and Reviews
- Muhammad Faheem + 3 more
More and more complex cyberattacks targeting America’s essential infrastructure endanger the nation’s safety, financial health and people’s safety. A lot of the time, rule-based cybersecurity does not notice new and growing dangers in real-time, leaving major systems exposed. Our research introduces an AI cyber threat detection framework based on using autoencoders and LSTM networks that improves both accuracy and speed in finding threats. Continual learning and reinforcement learning are part of the system so it can adapt to new threats in real time. Tests of our system on data from replay SCADA logs and NSL-KDD show very effective detection. The model’s dependability is confirmed by metrics such as precision, recall and F1-score and both its edge and cloud deployments allow for both speed and support for a growing number of devices. One solution to explain how AI reaches its decisions is to use SHAP and LIME. For now, we have applied our results to simulated situations, but our next step is to use the system in real places. The research introduced a resilient, flexible and easily explainable artificial intelligence method to make national critical infrastructure more secure.
- Research Article
- 10.1142/s1793962325500400
- Jun 21, 2025
- International Journal of Modeling, Simulation, and Scientific Computing
- Xingyu Zhou + 2 more
The optimal execution problem has always been a continuously focused research issue, and many reinforcement learning (RL) algorithms have been studied. In this paper, we address the execution problem of targeting the volume-weighted average price (VWAP) and propose a relaxed stochastic optimization framework with an entropy regularizer to promote exploration. We derive the explicit formula for the optimal policy, which follows a Gaussian distribution with its mean value being the solution to the original problem. By extending the framework of continuous RL to processes with jumps, we provide theoretical proofs for the convergence and performance of RL algorithms in this new setting. Additionally, we design two Actor-Critic algorithms, called ML-AC and MO-AC. First, minimizing the martingale loss function yields the optimal estimation of critic network parameters in the mean-square sense. Second, we utilize the martingale orthogonality condition as an alternative approach. The convergence of all algorithms has been verified across different environments, demonstrating a larger advantage in the environment with a stronger price impact. RL algorithms do not rely on explicit model assumptions or parameter estimation, enabling them to learn directly from interactions with the environment.
- Research Article
- 10.1080/10255842.2025.2519418
- Jun 14, 2025
- Computer Methods in Biomechanics and Biomedical Engineering
- Elnaz Kalhor + 2 more
The most important issue, which is met in this paper is quick treatment of melanoma. Medically, melanoma is known as one of the most malignant types of cancers. This disease can put the patients in the risk of death, if no quick action is taken. Mostly, medical experts tolerate serious challenges to determine the optimal dose. Intelligent methods can pave this way and efficiently assist them to reliably provide the best suitable dose for quick treatment. The RL approach seems to be one of the best candidates. But, the conventional RL lacks of high accuracy and speed, due to discrete states and actions and may result in increased control effort. These drawbacks have directed us to adopt the continuous RL, a combination of NNs and the RL approach. This has increased the accuracy and optimality of the dose in a continuous state space to control and annihilate the population of cancer cells, while the complexity of the approach is significantly low. According to physicians, treatment of melanoma in its initial stage takes two months. After this period, cancer cells will be completely eliminated in the patient’s body. Accordingly, a mathematical model of a patient with melanoma in initial stage is employed. The proposed method is analyzed using the Eligibility Traces algorithm, Q-learning algorithm and constant-dose injection method. The simulation results have indicated that when the combination of RL approach and NNs is adopted, after 50 days, the cancer cells will completely vanish. Besides, other parameters of the considered model will be within their normal range. However, when the Eligibility Traces and Q-learning algorithm is employed, after 50 days, cancer cells will be still present in the patient’s body. When the proposed hybrid method is used, the injected dose is significantly lower than that of other methods. As a consequence, the side effects of the drug will be reduced. Finally, in this result, the effectiveness of the proposed approach is evaluated in 5 melanoma patients, under the presence of uncertainty and noise. The obtained results have confirmed the promising capability of the adopted approach to control the population of cancer cells and reach a desired level.
- Research Article
- 10.1111/mice.13503
- May 19, 2025
- Computer-Aided Civil and Infrastructure Engineering
- Jing Chen + 4 more
Abstract Safe, efficient, and comfortable autonomous driving is essential for high‐quality transport service in an open road environment. However, most existing driving strategy learning approaches for autonomous driving struggle with varying driving environments, only working properly under certain scenarios. Therefore, this study proposes a novel hierarchical continual reinforcement learning (RL) framework to abstract various driving patterns as skills and support driving strategy adaptation based on vehicle‐cloud collaboration. The proposed framework leverages skill abstracting in the cloud to learn driving skills from massive demonstrations and store them as deep RL models, mitigating catastrophic forgetting and data imbalance for driving strategy adaptation. Connected autonomous vehicles’ (CAVs) driving strategies are sent to the cloud and continually updated by integrating abstracted driving skills and interactions with parallel environments in the cloud. Then, CAVs receive updated driving strategies from the cloud to interact with the real‐time environment. In the experiment, high‐fidelity and stochastic environments are created using real‐world pavement and traffic data. Experimental results showcase the proposed hierarchical continual RL framework exhibits a 34.04% reduction in potentially hazardous events and a 9.04% improvement in vertical comfort, compared to a classical RL baseline, demonstrating superior driving performance and strong generalization capabilities in varying driving environments. Overall, the proposed framework reinvigorates streaming driving data, prevailing motion planning models, and cloud computation resources for life‐long driving strategy learning.
- Research Article
- 10.3390/computers14050172
- May 2, 2025
- Computers
- George Balaskas + 4 more
This paper introduces a novel framework for addressing domain adaptation challenges in large language models (LLMs), emphasising privacy-preserving synthetic data generation and efficient fine-tuning. The proposed framework employs a multi-stage approach that includes document ingestion, relevance assessment, and automated dataset creation. This process reduces the need for extensive technical expertise while safeguarding data privacy. We evaluate the framework’s performance on domain-specific tasks in fields such as biobanking and public health, demonstrating that models fine-tuned using our method achieve results comparable to larger proprietary models. Crucially, these models maintain their general instruction-following capabilities, even when adapted to specialised domains, as shown through experiments with 7B and 8B parameter LLMs. Key components of the framework include continuous pre-training, supervised fine-tuning (SFT), and reinforcement learning methods such as direct preference optimisation (DPO), which together provide a flexible and configurable solution for deploying LLMs. The framework supports both local models and API-based solutions, making it scalable and accessible. By enabling privacy-preserving, domain-specific adaptation without requiring extensive expertise, this framework represents a significant step forward in the deployment of LLMs for specialised applications. The framework significantly lowers the barrier to domain adaptation for small- and medium-sized enterprises (SMEs), enabling them to utilise the power of LLMs without requiring extensive resources or technical expertise.
- Research Article
- 10.1016/j.aei.2025.103297
- May 1, 2025
- Advanced Engineering Informatics
- Haoze Wu + 5 more
Continual contrastive reinforcement learning: Towards stronger agent for environment-aware fault diagnosis of aero-engines through long-term optimization under highly imbalance scenarios
- Research Article
- 10.3390/sym17050638
- Apr 23, 2025
- Symmetry
- Chayoung Kim
Stable Q-value estimation is critical for effective policy learning in deep reinforcement learning (DRL), especially continuous control tasks. Traditional algorithms like Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic (TD3) policy gradients rely on Mean Squared Error (MSE) loss for Q-value approximation, which may cause instability due to misestimation and overestimation biases. Although distributional reinforcement learning (RL) algorithms like C51 have improved robustness in discrete action spaces, their application to continuous control remains computationally expensive owing to distribution projection needs. To address this, we propose a classification-based Q-value learning method that reformulates Q-value estimation as a classification problem rather than a regression task. Replacing MSE loss with cross-entropy (CE) and Kullback–Leibler (KL) divergence losses, the proposed method improves learning stability and mitigates overestimation errors. Our statistical analysis across 30 independent runs shows that the approach achieves an approximately 10% lower Q-value estimation error in the pendulum environment and a 40–60% reduced training time compared to SAC and Continuous Twin Delayed Distributed Deep Deterministic (CTD4) Policy Gradient. Experimental results on OpenAI Gym benchmark environments demonstrate that our approach, with up to 77% fewer parameters, outperforms the SAC and CTD4 policy gradients regarding training stability and convergence speed, while maintaining a competitive final policy performance.
- Research Article
- 10.1609/aaai.v39i28.35251
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Tadeusz Dziarmaga + 3 more
Continual reinforcement learning (CRL) is the study of optimal strategies for maximizing rewards in sequential environments that change over time. This is particularly crucial in domains such as robotics, where the operational environment is inherently dynamic and subject to continual change. Nevertheless, research in this area has thus far concentrated on off-policy algorithms with replay buffers that are capable of amortizing the impact of distribution shifts. Such an approach is not feasible with on-policy reinforcement learning algorithms that learn solely from the data obtained from the current policy. In this paper, we examine the performance of proximal policy optimization (PPO), a prevalent on-policy reinforcement learning (RL) algorithm, in a classical CRL benchmark. Our findings suggest that the current methods are suboptimal in terms of average performance. Nevertheless, they demonstrate encouraging competitive outcomes with respect to forward transfer and forgetting metrics. This highlights the need for further research into continual on-policy reinforcement learning. The source code is available at https://github.com/Teddy298/continualworld-ppo.
- Research Article
- 10.1609/aaai.v39i18.34163
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Rashmeet Kaur Nayyar + 1 more
Abstraction is key to scaling up reinforcement learning (RL). However, autonomously learning abstract state and action representations to enable transfer and generalization remains a challenging open problem. This paper presents a novel approach for inventing, representing, and utilizing options, which represent temporally extended behaviors, in continual RL settings. Our approach addresses streams of stochastic problems characterized by long horizons, sparse rewards, and unknown transition and reward functions. Our approach continually learns and maintains an interpretable state abstraction, and uses it to invent high-level options with abstract symbolic representations. These options meet three key desiderata: (1) composability for solving tasks effectively with lookahead planning, (2) reusability across problem instances for minimizing the need for relearning, and (3) mutual independence for reducing interference among options. Our main contributions are approaches for continually learning transferable, generalizable options with symbolic representations, and for integrating search techniques with RL to efficiently plan over these learned options to solve new problems. Empirical results demonstrate that the resulting approach effectively learns and transfers abstract knowledge across problem instances, achieving superior sample efficiency compared to state-of-the-art methods.
- Research Article
- 10.1609/aaai.v39i24.34799
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Lulu Zhao + 3 more
Recently, both closed-source and open-source LLMs have made significant strides, outperforming humans in various general domains. However, their performance in specific professional domains such as medicine, especially within the open-source community, remains suboptimal due to the complexity of medical knowledge. In this paper, we propose CareBot, a bilingual medical LLM, which leverages a comprehensive approach integrating continuous pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF). Our novel two-stage CPT method, comprising Stable CPT and Boost CPT, effectively bridges the gap between general and domain-specific data, facilitating a smooth transition from pre-training to fine-tuning and enhancing domain knowledge progressively. We also introduce DataRater, a model designed to assess data quality during CPT, ensuring that the training data is both accurate and relevant. For SFT, we develope a large and diverse bilingual dataset, along with ConFilter, a metric to enhance multi-turn dialogue quality, which is crucial to improving the model's ability to handle more complex dialogues. The combination of high-quality data sources and innovative techniques significantly improves CareBot's performance across a range of medical applications. Our rigorous evaluations on Chinese and English benchmarks confirm CareBot's effectiveness in medical consultation and education. These advancements not only address current limitations in medical LLMs but also set a new standard for developing effective and reliable open-source models in the medical domain.
- Research Article
1
- 10.1109/tnnls.2024.3387871
- Apr 1, 2025
- IEEE Transactions on Neural Networks and Learning Systems
- Alireza Ramezani Moghaddam + 1 more
In this article, we investigate the Nash-seeking problem of a set of agents, playing an infinite network aggregative Markov game. In particular, we focus on a noncooperative framework where each agent selfishly aims at maximizing its long-term average reward without having explicit information on the model of the environment dynamics and its own reward function. The main contribution of this article is to develop a continuous multiagent reinforcement learning (MARL) algorithm for the Nash-seeking problem in infinite dynamic games with convergence guarantee. To this end, we propose an actor-critic MARL algorithm based on expected policy gradient (EPG) with two general function approximators to estimate the value function and the Nash policy of the agents. We consider continuous state and action spaces and adopt a newly proposed EPG to alleviate the variance of the gradient approximation. Based on such formulation and under some conventional assumptions (e.g., using linear function approximators), we prove that the policies of the agents converge to the unique Nash equilibrium (NE) of the game. Furthermore, an estimation error analysis is conducted to investigate the effects of the error arising from function approximation. As a case study, the framework is applied on a cloud radio access network (C-RAN) by modeling the remote radio heads (RRHs) as the agents and the congestion of baseband units (BBUs) as the dynamics of the environment.