Reinforcement learning for medical image analysis: a systematic review of algorithms, engineering challenges, and clinical deployment
Reinforcement learning (RL) has emerged as a powerful artificial intelligence paradigm in medical image analysis, excelling in complex decision-making tasks. This systematic review synthesizes the applications of RL across diverse imaging domains—including landmark detection, image segmentation, lesion identification, disease diagnosis, and image registration—by analyzing 20 peer-reviewed studies published between 2019 and 2023. RL methods are categorized into classical and deep reinforcement learning (DRL) approaches, focusing on their performance, integration with other machine learning models, and clinical utility. Deep Q-Networks (DQN) demonstrated strong performance in anatomical landmark detection and cardiovascular risk estimation, while Proximal Policy Optimization (PPO) and Advantage Actor-Critic (A2C) achieved optimal policy learning for vessel tracking. Policy gradient methods such as REINFORCE, Twin-Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC) were successfully applied to breast lesion detection, white-matter connectivity analysis, and vertebral segmentation.Monte Carlo learning, meta-RL, and A3C methods proved effective for adaptive questioning, image quality evaluation, and multimodal image registration. To consolidate these findings, we propose a unified Reinforcement Learning Medical Imaging (RLMI) framework encompassing four core components: state representation, policy optimization, reward formulation, and environment modeling. This framework enhances sequential agent learning, stabilizes navigation, and generalizes across imaging modalities and tasks. Key challenges remain, including optimizing task-specific policies, integrating anatomical contexts, addressing data scarcity, and improving interpretability. This review highlights RL’s potential to enhance accuracy, adaptability, and efficiency in medical image analysis, providing valuable guidance for researchers and clinicians applying RL in real-world healthcare settings.
- Research Article
10
- 10.1016/j.comnet.2024.110670
- Jul 23, 2024
- Computer Networks
Next-gen resource optimization in NB-IoT networks: Harnessing soft actor–critic reinforcement learning
- Research Article
44
- 10.3390/drones7040245
- Apr 1, 2023
- Drones
Unmanned Aerial Vehicles (UAVs), also known as drones, have advanced greatly in recent years. There are many ways in which drones can be used, including transportation, photography, climate monitoring, and disaster relief. The reason for this is their high level of efficiency and safety in all operations. While the design of drones strives for perfection, it is not yet flawless. When it comes to detecting and preventing collisions, drones still face many challenges. In this context, this paper describes a methodology for developing a drone system that operates autonomously without the need for human intervention. This study applies reinforcement learning algorithms to train a drone to avoid obstacles autonomously in discrete and continuous action spaces based solely on image data. The novelty of this study lies in its comprehensive assessment of the advantages, limitations, and future research directions of obstacle detection and avoidance for drones, using different reinforcement learning techniques. This study compares three different reinforcement learning strategies—namely, Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC)—that can assist in avoiding obstacles, both stationary and moving; however, these strategies have been more successful in drones. The experiment has been carried out in a virtual environment made available by AirSim. Using Unreal Engine 4, the various training and testing scenarios were created for understanding and analyzing the behavior of RL algorithms for drones. According to the training results, SAC outperformed the other two algorithms. PPO was the least successful among the algorithms, indicating that on-policy algorithms are ineffective in extensive 3D environments with dynamic actors. DQN and SAC, two off-policy algorithms, produced encouraging outcomes. However, due to its constrained discrete action space, DQN may not be as advantageous as SAC in narrow pathways and twists. Concerning further findings, when it comes to autonomous drones, off-policy algorithms, such as DQN and SAC, perform more effectively than on-policy algorithms, such as PPO. The findings could have practical implications for the development of safer and more efficient drones in the future.
- Research Article
1
- 10.54254/2755-2721/2025.tj23321
- May 22, 2025
- Applied and Computational Engineering
Policy gradient (PG) methods are a fundamental component of deep reinforcement learning (DRL), particularly effective in continuous and high-dimensional control tasks. This paper presents a structured review of PG algorithms, tracing their development from basic Monte Carlo methods like REINFORCE to advanced techniques such as asynchronous advantage actor-critic (A3C), trust region policy optimization (TRPO), proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), and soft actor-critic (SAC). These methods differ in terms of policy structure, optimization stability, and sample efficiency, addressing core challenges in policy learning through gradient-based updates. In addition, this review explores the application of PG methods in real-world domains, including autonomous driving, financial portfolio management, and smart grid energy systems. These applications demonstrate PG methods capacity to operate under uncertainty and adapt to complex dynamic environments. However, limitations such as high variance, low sample efficiency, and instability in multi-agent and offline settings remain significant obstacles. The review concludes by outlining emerging research directions, including entropy-based exploration, model-based policy optimization, meta-learning, and Transformer-based sequence modeling. This work aims to offer theoretical insights and practical guidance to support the continued advancement and application of policy gradient methods in reinforcement learning.
- Book Chapter
2
- 10.1007/978-3-030-95502-1_27
- Jan 1, 2022
In recent years, Deep Reinforcement Learning (DRL) has been extensively used to solve problems in various domains like traffic control, healthcare, and simulation-based training. Proximal Policy Optimization (PPO) and Soft-Actor Critic (SAC) methods are DRL’s latest state of art on-policy and off-policy algorithms. Though previous studies have shown that SAC generally performs better than PPO, hyperparameter tuning can significantly impact the performance of these algorithms. Also, a systematic evaluation of the efficacy of these algorithms after hyperparameter tuning in dynamic and complex environments is missing and much needed in literature. This research aims to evaluate the effect of the number of layers and nodes in SAC and PPO algorithms in a search-and-retrieve task developed in the Unity 3D game engine. In the task, a bot had to navigate through the physical mesh and collect ‘target’ objects while avoiding ‘distractor’ objects. We compared the SAC and PPO models on four different test conditions that differed in the ratios of targets and distractors. Results revealed that PPO performed better than SAC for all test conditions when the number of layers and units present in the architecture was the lowest. When the number of targets was more than the distractors (9:1), PPO outperformed SAC, especially when the number of units and layers were large. Furthermore, increasing the layers and units per layer was responsible for increasing PPO and SAC performance. Results also implied that similar hyperparameter settings might be used while comparing models developed using DRL algorithms. We discuss the implications of these results and explore the possible applications of using modern, state-of-the-art DRL algorithms to learn the semantics and idiosyncrasies associated with complex and dynamic environments.KeywordsProximal Policy OptimizationSoft-Actor CriticDeep reinforcement learningVirtual environmentsUnity3D
- Supplementary Content
60
- 10.1002/acm2.13898
- Jan 10, 2023
- Journal of Applied Clinical Medical Physics
MotivationMedical image analysis involves a series of tasks used to assist physicians in qualitative and quantitative analyses of lesions or anatomical structures which can significantly improve the accuracy and reliability of medical diagnoses and prognoses. Traditionally, these tedious tasks were finished by experienced physicians or medical physicists and were marred with two major problems, low efficiency and bias.In the past decade, many machine learning methods have been applied to accelerate and automate the image analysis process. Compared to the enormous deployments of supervised and unsupervised learning models, attempts to use reinforcement learning in medical image analysis are still scarce. We hope that this review article could serve as the stepping stone for related research in the future.SignificanceWe found that although reinforcement learning has gradually gained momentum in recent years, many researchers in the medical analysis field still find it hard to understand and deploy in clinical settings. One possible cause is a lack of well‐organized review articles intended for readers without professional computer science backgrounds. Rather than to provide a comprehensive list of all reinforcement learning models applied in medical image analysis, the aim of this review is to help the readers formulate and solve their medical image analysis research through the lens of reinforcement learning.Approach & ResultsWe selected published articles from Google Scholar and PubMed. Considering the scarcity of related articles, we also included some outstanding newest preprints. The papers were carefully reviewed and categorized according to the type of image analysis task. In this article, we first reviewed the basic concepts and popular models of reinforcement learning. Then, we explored the applications of reinforcement learning models in medical image analysis. Finally, we concluded the article by discussing the reviewed reinforcement learning approaches’ limitations and possible future improvements.
- Research Article
11
- 10.1016/j.neucom.2023.127145
- Dec 11, 2023
- Neurocomputing
Reinforcement learning for Hybrid Disassembly Line Balancing Problems
- Research Article
11
- 10.2215/cjn.0000000000000084
- Feb 8, 2023
- Clinical Journal of the American Society of Nephrology
Introduction Reinforcement learning formalizes the concept of learning from interactions.1 Broadly, reinforcement learning focuses on a setting in which an agent (decision maker) sequentially interacts with an environment that is partially unknown to them. At each stage, the agent takes an action and receives a reward. The objective of the agent is to maximize rewards accumulated in the long run. There are many situations in health care where decisions are made sequentially for which reinforcement learning approaches could prove useful for decision making. Throughout this article, we consider treatment prescription as an archetypical example to connect reinforcement learning concepts to a health care setting. In this setting, the care provider, the prescribed treatment, and the patients can be viewed as the agent, the action, and the environment, respectively, as depicted in Figure 1.Figure 1: Sequential treatment of AKI or CKD complications modeled as a reinforcement learning problem.Background In this section, with the objective of making reinforcement learning literature more accessible to a clinical audience, we briefly introduce related fundamental concepts and approaches. We refer the interested reader to Sutton and Barto1 for a comprehensive introduction to reinforcement learning. Markov Decision Processes Markov decision processes (MDPs) are a formalism of the sequential decision-making problem that has been central to the theoretical and practical advancements of reinforcement learning. In each stage of an MDP, the agent observes the state of the environment and takes an action, which, in turn, results in a change of the state. This change of state is assumed to be probabilistic with the next state being determined only by the preceding state, the chosen action, and the transition probability. The agent also receives a reward that is a function of the taken action, the preceding state, and the subsequent state. In an MDP, the objective of the agent is to maximize the return defined as the reward accumulated over a time horizon. In some applications, it is common to consider the horizon to be infinite, in which case the future rewards are discounted by a factor smaller than one. The selection of action by the agent on the basis of the observed state is known as the policy. More formally, a policy is a probabilistic mapping from states to each possible action. Because the policy and the reward are a function of the state, it is critical to estimate the utility of being in a certain state. More specifically, the value function is defined as the expected return starting from a given state under the chosen policy. Under this formalism, the objective of the agent is to find the optimal policy that maximizes the value function for all states. Reinforcement Learning Methods Action-value methods are a class of reinforcement learning methods in which the actions are chosen on the basis of the estimation of their long-term value. A prominent example of an action-value method is Q-learning in which the agent iteratively takes actions with the highest estimated values and updates the action-state value function on the basis of new observations. Policy gradient methods are another class of reinforcement learning methods that seek to optimize the policy directly instead of choosing actions on the basis of their respective estimated value. Such methods could be advantageous in health care applications that entail a large number of possible actions, e.g., when recommending a wide range of drug dosages or treatment options. Clinical Applications Reinforcement learning frameworks and methods are broadly applicable to clinical settings in which decisions are made sequentially. A prominent clinical application of reinforcement learning is for treatment recommendation, which has been studied across a variety of diseases and treatments including radiation and chemotherapy for cancer, brain stimulation for epilepsy, and treatment strategies for sepsis.2–5 In such treatment recommendation settings, a policy is commonly known as a dynamic treatment regime. There are various other clinical applications of reinforcement learning including diagnosis, medical imaging, and decision support tools (see refs. 2–5 and the references therein). Reinforcement Learning in Nephrology Although there have been recent applications of machine learning in nephrology,6,7 to the best of the authors' knowledge, the application of reinforcement learning to nephrology has been primarily limited to optimizing the erythropoietin dosage in hemodialysis patients.8,9 However, there are other settings where reinforcement learning has the potential to improve patient care in nephrology. For example, reinforcement learning methods can be adopted in the treatment of the complications of AKI or CKD (Figure 1). In this problem, the state models the conditions of the patient (e.g., vital signs, laboratory test results including urine and blood tests, and urine output measurements). The action refers to the treatment options (e.g., the dosage of medications such as sodium polystyrene sulfonate, and hemodialysis). The reward models the improvement in patient conditions. Similarly, reinforcement learning can help automate and optimize the dosage of immunosuppressive drugs in kidney transplants. Challenges and Opportunities Despite the success of reinforcement learning in several simplified clinical settings, their large-scale application to patient care faces several open challenges. The complexity of human biology complicates modeling clinical decision making as a reinforcement learning problem. The state space in such settings is often enormous, which could make a purely computational approach infeasible. Moreover, modeling all potential objectives a priori as a reward function may not be feasible. To overcome these challenges and realize the potential of reinforcement learning, clinical insight can play a pivotal role. More specifically, restricting the state space to only include highly relevant clinical variables could greatly reduce the computational complexity. Furthermore, using inverse reinforcement learning,2 relevant reward functions can be learned from retrospective studies assuming the optimality of clinical decisions. Another critical challenge is addressing moral and ethical concerns. It is imperative to ensure that reinforcement learning methods do not cause harm to the patient. To this end, there exists a need for a thorough validation of such methods before their use in patient care. Hence, there is a need to go beyond retrospective studies that have been used for the proof of concept of most existing reinforcement learning methods in health care applications.2,3 The lessons learned from the success of reinforcement learning in other application areas (e.g., self-driving cars) can help navigate the path to realizing its potential in health care. Accessible open-source simulation environments that enable researchers to compare various approaches are essential to the field of reinforcement learning. OpenAI Gym is currently the leading toolkit containing a wide range of simulated environments, e.g., surgical robotics.10 The development of high-quality and reliable simulation environments for nephrology and other health care applications can facilitate the development and validation of reinforcement learning methods beyond limited retrospective studies. The adoption of methods validated in such simulation environments in actual clinical settings will require clinicians' oversight. Similar to how self-driving cars require a human driver to ensure collision avoidance, clinicians' oversight is critical to ensure the safety of the patients, especially in the early stages of the adoption of reinforcement learning methods. The data from clinicians' decisions (e.g., overruling the automated treatment recommendation) can be used to improve the reliability of autonomous systems over time and reduce the burden of clinicians' oversight.
- Abstract
- 10.1182/blood-2023-178306
- Nov 2, 2023
- Blood
Deep Reinforcement Learning for Managing Platelets in a Hospital Blood Bank
- Research Article
46
- 10.1016/j.oceaneng.2022.111882
- Jul 11, 2022
- Ocean Engineering
Marine route optimization using reinforcement learning approach to reduce fuel consumption and consequently minimize CO2 emissions
- Research Article
179
- 10.1016/j.media.2019.02.007
- Feb 14, 2019
- Medical Image Analysis
Evaluating reinforcement learning agents for anatomical landmark detection.
- Research Article
1
- 10.3390/a18020106
- Feb 15, 2025
- Algorithms
In the field of gaming artificial intelligence, selecting the appropriate machine learning approach is essential for improving decision-making and automation. This paper examines the effectiveness of deep reinforcement learning (DRL) within interactive gaming environments, focusing on complex decision-making tasks. Utilizing the Unity engine, we conducted experiments to evaluate DRL methodologies in simulating realistic and adaptive agent behavior. A vehicle driving game is implemented, in which the goal is to reach a certain target within a small number of steps, while respecting the boundaries of the roads. Our study compares Proximal Policy Optimization (PPO) and Soft Actor–Critic (SAC) in terms of learning efficiency, decision-making accuracy, and adaptability. The results demonstrate that PPO successfully learns to reach the target, achieving higher and more stable cumulative rewards. Conversely, SAC struggles to reach the target, displaying significant variability and lower performance. These findings highlight the effectiveness of PPO in this context and indicate the need for further development, adaptation, and tuning of SAC. This research contributes to developing innovative approaches in how ML can improve how player agents adapt and react to their environments, thereby enhancing realism and dynamics in gaming experiences. Additionally, this work emphasizes the utility of using games to evolve such models, preparing them for real-world applications, namely in the field of vehicles’ autonomous driving and optimal route calculation.
- Research Article
10
- 10.3390/app13010633
- Jan 3, 2023
- Applied Sciences
There are several automated stock trading programs using reinforcement learning, one of which is an ensemble strategy. The main idea of the ensemble strategy is to train DRL agents and make an ensemble with three different actor–critic algorithms: Advantage Actor–Critic (A2C), Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO). This novel idea was the concept mainly used in this paper. However, we did not stop there, but we refined the automated stock trading in two areas. First, we made another DRL-based ensemble and employed it as a new trading agent. We named it Remake Ensemble, and it combines not only A2C, DDPG, and PPO but also Actor–Critic using Kronecker-Factored Trust Region (ACKTR), Soft Actor–Critic (SAC), Twin Delayed DDPG (TD3), and Trust Region Policy Optimization (TRPO). Furthermore, we expanded the application domain of automated stock trading. Although the existing stock trading method treats only 30 Dow Jones stocks, ours handles KOSPI stocks, JPX stocks, and Dow Jones stocks. We conducted experiments with our modified automated stock trading system to validate its robustness in terms of cumulative return. Finally, we suggested some methods to gain relatively stable profits following the experiments.
- Research Article
57
- 10.1016/j.apenergy.2022.120113
- Nov 1, 2022
- Applied Energy
Deep reinforcement learning based optimization for a tightly coupled nuclear renewable integrated energy system
- Research Article
38
- 10.3390/buildings12020131
- Jan 27, 2022
- Buildings
When deep reinforcement learning (DRL) methods are applied in energy consumption prediction, performance is usually improved at the cost of the increasing computation time. Specifically, the deep deterministic policy gradient (DDPG) method can achieve higher prediction accuracy than deep Q-network (DQN), but it requires more computing resources and computation time. In this paper, we proposed a deep-forest-based DQN (DF–DQN) method, which can obtain higher prediction accuracy than DDPG and take less computation time than DQN. Firstly, the original action space is replaced with the shrunken action space to efficiently find the optimal action. Secondly, deep forest (DF) is introduced to map the shrunken action space to a single sub-action space. This process can determine the specific meaning of each action in the shrunken action space to ensure the convergence of DF–DQN. Thirdly, state class probabilities obtained by DF are employed to construct new states by considering the probabilistic process of shrinking the original action space. The experimental results show that the DF–DQN method with 15 state classes outperforms other methods and takes less computation time than DRL methods. MAE, MAPE, and RMSE are decreased by 5.5%, 7.3%, and 8.9% respectively, and R2 is increased by 0.3% compared to the DDPG method.
- Research Article
10
- 10.3389/frspt.2023.1263489
- Nov 29, 2023
- Frontiers in Space Technologies
Deep reinforcement learning (DRL) has shown promise for spacecraft planning and scheduling due to the lack of constraints on model representation, the ability of trained policies to achieve optimal performance with respect to a reward function, and fast execution times of the policies after training. Past work investigates various problem formulations, algorithms, and safety methodologies, but a comprehensive comparison between different DRL methods and problem formulations has not been performed for spacecraft scheduling problems. This work formulates two Earth-observing satellite (EOS) scheduling problems with resource constraints regarding power, reaction wheel speeds, and on-board data storage. The environments provide both simple and complex scheduling challenges for benchmarking DRL performance. Policy gradient and value-based reinforcement learning algorithms are trained for each environment and are compared on the basis of performance, performance variance between different seeds, and wall clock time. Advantage actor-critic (A2C), deep Q-networks (DQN), proximal policy optimization (PPO), shielded proximal policy optimization (SPPO) and a Monte Carlo tree search based training-pipeline (MCTS-Train) are applied to each EOS scheduling problem. Hyperparameter tuning is performed for each method, and the best performing hyperparameters are selected for comparison. Each DRL algorithm is also compared to a genetic algorithm, which provides a point of comparison outside the field of DRL. PPO and SPPO are shown to be the most stable algorithms, converging quickly to high-performing policies between different experiments. A2C and DQN are typically able to produce high-performing policies, but with relatively high variance across the selected hyperparameters. MCTS-Train is capable of producing high-performing policies for most problems, but struggles when long planning horizons are utilized. The results of this work provide a basis for selecting reinforcement learning algorithms for spacecraft planning and scheduling problems. The algorithms and environments used in this work are provided in a Python package called bsk_rl to facilitate future research in this area.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.