A Review of Reinforcement Learning in Financial Applications

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

In recent years, there has been a growing trend of applying reinforcement learning (RL) in financial applications. This approach has shown great potential for decision-making tasks in finance. In this review, we present a comprehensive study of the applications of RL in finance and conduct a series of meta-analyses to investigate the common themes in the literature, such as the factors that most significantly affect RL's performance compared with traditional methods. Moreover, we identify challenges, including explainability, Markov decision process modeling, and robustness, that hinder the broader utilization of RL in the financial industry and discuss recent advancements in overcoming these challenges. Finally, we propose future research directions, such as benchmarking, contextual RL, multi-agent RL, and model-based RL, to address these challenges and to further enhance the implementation of RL in finance.

Similar Papers
  • Conference Article
  • Cite Count Icon 7
  • 10.1109/reepe57272.2023.10086785
MARLMUI: Multi-Agent Reinforcement Learning Approach in Mobile Adaptive User Interface
  • Mar 16, 2023
  • Dmitry A Vidmanov + 1 more

The article is devoted to the study of machine learning approaches in the processing of the user interface of mobile ecosystems in order to test the approach of adaptive interfaces, which formalizes them as a stochastic sequential decision problem, as well as the use of multi-agent model-based reinforcement learning for adaptation planning. This article introduces for the first time the use of reinforcement learning in a mobile adaptive user interface. The article presents adaptation options based on changing the representation of elements, as well as a transition function for the model of Markov decision processes. This article proposes a novel method called MARLMUI, a Multi-Agent Reinforcement Learning Mobile User Interface. In conclusion, Dec-POMDP, a decentralized partially observable MDP model, is considered as a proposed interface processing algorithm based on multi-agent reinforcement learning. This study is first attempt to systematize knowledge and practically implement an adaptive interface in the mobile ecosystem.

  • Research Article
  • Cite Count Icon 53
  • 10.1109/tccn.2019.2933420
Online Antenna Tuning in Heterogeneous Cellular Networks With Deep Reinforcement Learning
  • Dec 1, 2019
  • IEEE Transactions on Cognitive Communications and Networking
  • Eren Balevi + 1 more

We aim to jointly optimize antenna tilt angle, and vertical and horizontal half-power beamwidths of the macrocells in a heterogeneous cellular network (HetNet). The interactions between the cells, most notably due to their coupled interference render this optimization prohibitively complex. Utilizing a single agent reinforcement learning (RL) algorithm for this optimization becomes quite suboptimum despite its scalability, whereas multi-agent RL algorithms yield better solutions at the expense of scalability. Hence, we propose a two-step compromise algorithm. Specifically, a multi-agent mean field RL algorithm is first utilized in the offline phase so as to transfer information as features for the second (online) phase single agent RL algorithm, which employs a deep neural network to learn users locations. This two-step approach is a practical solution for real deployments, which should automatically adapt to environmental changes in the network. Our results illustrate that the proposed algorithm approaches the performance of the multi-agent RL, which requires millions of trials, with hundreds of online trials, assuming relatively low environmental dynamics, and performs much better than a single agent RL. Furthermore, the proposed algorithm is compact and implementable, and empirically appears to provide a performance guarantee regardless of the amount of environmental dynamics.

  • Book Chapter
  • Cite Count Icon 213
  • 10.1007/978-3-642-27645-3_14
Game Theory and Multi-agent Reinforcement Learning
  • Jan 1, 2012
  • Ann Nowé + 2 more

Reinforcement Learning was originally developed for Markov Decision Processes (MDPs). It allows a single agent to learn a policy that maximizes a possibly delayed reward signal in a stochastic stationary environment. It guarantees convergence to the optimal policy, provided that the agent can sufficiently experiment and the environment in which it is operating is Markovian. However, when multiple agents apply reinforcement learning in a shared environment, this might be beyond the MDP model. In such systems, the optimal policy of an agent depends not only on the environment, but on the policies of the other agents as well. These situations arise naturally in a variety of domains, such as: robotics, telecommunications, economics, distributed control, auctions, traffic light control, etc. In these domains multi-agent learning is used, either because of the complexity of the domain or because control is inherently decentralized. In such systems it is important that agents are capable of discovering good solutions to the problem at hand either by coordinating with other learners or by competing with them. This chapter focuses on the application reinforcement learning techniques in multi-agent systems. We describe a basic learning framework based on the economic research into game theory, and illustrate the additional complexity that arises in such systems. We also described a representative selection of algorithms for the different areas of multi-agent reinforcement learning research.

  • Research Article
  • Cite Count Icon 10
  • 10.1360/ssi-2020-0180
Review of the progress of communication-based multi-agent reinforcement learning
  • May 1, 2022
  • SCIENTIA SINICA Informationis
  • 涵 王 + 2 more

Reinforcement learning (RL) technology has been successfully applied to various continuous decision environments in decades of development. Nowadays, RL is attracting more attention, even being touted as one of the closest approaches to general artificial intelligence. However, real-world problems often involve multiple intelligent agents interacting with each other. Thus, we focus on multi-agent reinforcement learning (MARL) to deal with such multi-agent systems in practice. In the past decade, the combination of multi-agent system and RL has become increasingly close, gradually forming and enriching the research field of MARL. Reviewing the studies on MARL, we found that researchers mainly solve MARL problems from three perspectives: learning framework, joint action learning, and communication-based MARL. In this paper, we focus from the studies on the communication perspective. We first state the reasons for choosing communication-based MARL and then list the president studies falling into the MARL category but different in nature. We hope that this article can provide a reference for developing MARL methods that can solve practical problems for the national welfare.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.ijepes.2022.108848
A reactive power optimization partially observable Markov decision process with data uncertainty using multi-agent actor-attention-critic algorithm
  • Dec 5, 2022
  • International Journal of Electrical Power & Energy Systems
  • Yaru Gu + 1 more

A reactive power optimization partially observable Markov decision process with data uncertainty using multi-agent actor-attention-critic algorithm

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/sice.2007.4421462
Multiple timescales PIA for cooperative reinforcement learning based on MDP model
  • Sep 1, 2007
  • Tomohiro Yamaguchi + 1 more

This paper describes a new method of dynamic programming (DP) based multiagent reinforcement learning in Markov decision process (MDP) model. It is difficult for agents to learn cooperative actions among agents properly in multiagent because they may change each policy in same time. To solve this problem, each agent should learn in different time for each policy improvement. Therefore, we propose multiple timescales policy improvement method. We show comparative experiments between multiple timescales policy improvement and exclusive policy improvement. As a result, our methods reduced the search costs for the optimal common-payoff Nash solution.

  • Conference Article
  • 10.65109/ptzx6262
Reinforcement Learning in the Wild with Maximum Likelihood-based Model Transfer
  • May 6, 2024
  • Hannes Eriksson + 4 more

In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as Model Transfer Reinforcement Learning (MTRL) problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a constrained Maximum Likelihood Estimation (MLE) -based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP.

  • Research Article
  • Cite Count Icon 36
  • 10.1016/j.eswa.2022.117932
Reinforcement learning-based expanded personalized diabetes treatment recommendation using South Korean electronic health records
  • Jun 23, 2022
  • Expert Systems with Applications
  • Sang Ho Oh + 4 more

Reinforcement learning-based expanded personalized diabetes treatment recommendation using South Korean electronic health records

  • Research Article
  • Cite Count Icon 24
  • 10.3390/buildings12050641
AlphaTruss: Monte Carlo Tree Search for Optimal Truss Layout Design
  • May 11, 2022
  • Buildings
  • Ruifeng Luo + 3 more

Truss layout optimization under complex constraints has been a hot and challenging problem for decades that aims to find the optimal node locations, connection topology between nodes, and cross-sectional areas of connecting bars. Monte Carlo Tree Search (MCTS) is a reinforcement learning search technique that is competent to solve decision-making problems. Inspired by the success of AlphaGo using MCTS, the truss layout problem is formulated as a Markov Decision Process (MDP) model, and a 2-stage MCTS-based algorithm, AlphaTruss, is proposed for generating optimal truss layout considering topology, geometry, and bar size. In this MDP model, three sequential action sets of adding nodes, adding bars, and selecting sectional areas greatly expand the solution space and the reward function gives feedback to actions according to both geometric stability and structural simulation. To find the optimal sequential actions, AlphaTruss solves the MDP model and gives the best decision in each design step by searching and learning through MCTS. Compared with existing results from the literature, AlphaTruss exhibits better performance in finding the truss layout with the minimum weight under stress, displacement, and buckling constraints, which verifies the validity and efficiency of the established algorithm.

  • Research Article
  • Cite Count Icon 1
  • 10.5555/1016416.1016419
A multi-agent system integrating reinforcement learning, bidding and genetic algorithms
  • Dec 1, 2003
  • Qidehu + 1 more

This paper presents a multi-agent reinforcement learning bidding approach (MARLBS) for performing multi-agent reinforcement learning. MARLBS integrates reinforcement learning, bidding and genetic a...

  • Single Report
  • 10.62311/nesx/rriv225
Optimal Control and Reinforcement Learning: Theory, Algorithms, and Robotics Applications
  • Mar 19, 2025
  • Murali Krishna Pasupuleti

Abstract: Optimal control and reinforcement learning (RL) are foundational techniques for intelligent decision-making in robotics, automation, and AI-driven control systems. This research explores the theoretical principles, computational algorithms, and real-world applications of optimal control and reinforcement learning, emphasizing their convergence for scalable and adaptive robotic automation. Key topics include dynamic programming, Hamilton-Jacobi-Bellman (HJB) equations, policy optimization, model-based RL, actor-critic methods, and deep RL architectures. The study also examines trajectory optimization, model predictive control (MPC), Lyapunov stability, and hierarchical RL for ensuring safe and robust control in complex environments. Through case studies in self-driving vehicles, autonomous drones, robotic manipulation, healthcare robotics, and multi-agent systems, this research highlights the trade-offs between model-based and model-free approaches, as well as the challenges of scalability, sample efficiency, hardware acceleration, and ethical AI deployment. The findings underscore the importance of hybrid RL-control frameworks, real-world RL training, and policy optimization techniques in advancing robotic intelligence and autonomous decision-making. Keywords: Optimal control, reinforcement learning, model-based RL, model-free RL, dynamic programming, policy optimization, Hamilton-Jacobi-Bellman equations, actor-critic methods, deep reinforcement learning, trajectory optimization, model predictive control, Lyapunov stability, hierarchical RL, multi-agent RL, robotics, self-driving cars, autonomous drones, robotic manipulation, AI-driven automation, safety in RL, hardware acceleration, sample efficiency, hybrid RL-control frameworks, scalable AI.

  • Research Article
  • Cite Count Icon 14
  • 10.3390/math10152699
A Multi-Depot Dynamic Vehicle Routing Problem with Stochastic Road Capacity: An MDP Model and Dynamic Policy for Post-Decision State Rollout Algorithm in Reinforcement Learning
  • Jul 30, 2022
  • Mathematics
  • Wadi Khalid Anuar + 3 more

In the event of a disaster, the road network is often compromised in terms of its capacity and usability conditions. This is a challenge for humanitarian operations in the context of delivering critical medical supplies. To optimise vehicle routing for such a problem, a Multi-Depot Dynamic Vehicle-Routing Problem with Stochastic Road Capacity (MDDVRPSRC) is formulated as a Markov Decision Processes (MDP) model. An Approximate Dynamic Programming (ADP) solution method is adopted where the Post-Decision State Rollout Algorithm (PDS-RA) is applied as the lookahead approach. To perform the rollout effectively for the problem, the PDS-RA is executed for all vehicles assigned for the problem. Then, at the end, a decision is made by the agent. Five types of constructive base heuristics are proposed for the PDS-RA. First, the Teach Base Insertion Heuristic (TBIH-1) is proposed to study the partial random construction approach for the non-obvious decision. The heuristic is extended by proposing TBIH-2 and TBIH-3 to show how Sequential Insertion Heuristic (SIH) (I1) as well as Clarke and Wright (CW) could be executed, respectively, in a dynamic setting as a modification to the TBIH-1. Additionally, another two heuristics: TBIH-4 and TBIH-5 (TBIH-1 with the addition of Dynamic Lookahead SIH (DLASIH) and Dynamic Lookahead CW (DLACW) respectively) are proposed to improve the on-the-go constructed decision rule (dynamic policy on the go) in the lookahead simulations. The results obtained are compared with the matheuristic approach from previous work based on PDS-RA.

  • Research Article
  • Cite Count Icon 73
  • 10.1016/j.neucom.2021.09.044
Multi-target tracking for unmanned aerial vehicle swarms using deep reinforcement learning
  • Sep 27, 2021
  • Neurocomputing
  • Wenhong Zhou + 4 more

Multi-target tracking for unmanned aerial vehicle swarms using deep reinforcement learning

  • Research Article
  • 10.3390/jmse14010055
A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal
  • Dec 28, 2025
  • Journal of Marine Science and Engineering
  • Tianli Zuo + 5 more

The continued growth of international maritime trade has driven automated container terminals (ACTs) to pursue more efficient operational management strategies. In practice, the horizontal yard layout in ACTs significantly enhances transshipment efficiency. However, the more complex horizontal transporting system calls for an effective approach to enhance automated guided vehicle (AGV) scheduling. Considering AGV charging and path conflicts, this paper proposes a multi-agent reinforcement learning (MARL) approach to address the AGV dispatching and path planning (VD2P) problem under a horizontal layout. The VD2P problem is formulated as a Markov decision process model. To mitigate the challenges of high-dimensional state-action space, a multi-agent framework is developed to control the AGV dispatching and path planning separately. A mixed global–individual reward mechanism is tailored to enhance both exploration and corporation. A proximal policy optimization method is used to train the scheduling policies. Experiments indicate that the proposed MARL approach can provide high-quality solutions for a real-world-sized scenario within tens of seconds. Compared with benchmark methods, the proposed approach achieves an improvement of 8.4% to 53.8%. Moreover, sensitivity analyses are conducted to explore the impact of different AGV configurations and charging strategies on scheduling. Managerial insights are obtained to support more efficient terminal operations.

  • Book Chapter
  • Cite Count Icon 17
  • 10.1007/978-981-10-3433-6_56
Innovative Approach Towards Cooperation Models for Multi-agent Reinforcement Learning (CMMARL)
  • Jan 1, 2016
  • Deepak A Vidhate + 1 more

We propose an innovative approach towards Cooperation Models for Multi-agent Reinforcement Learning (CMMARL) using reinforcement learning methods. Communication methods for reinforcement learning depend on multiagent scheme is proposed & implemented. Different cooperation methods for cooperative reinforcement learning based on expertness measure of each agent proposed here i.e. group method, dynamic method, goal-oriented method and expert agent method. Implementation results have demonstrated that the suggested communication and cooperation methods are able to accelerate the aggregation of the agents that accomplish best action strategies. This approach is developed for dynamic products availability in a three retailer shops in the market. Retailers can cooperate with each other and can get benefit from cooperative information by their own policies that accurately represent their goals and interests. The retailers are the learning agents in the problem and apply reinforcement learning to learn cooperatively from the situation. By making considerable theory on the dealer’s inventory strategy, refill period, and entry procedure of the customers, the problem turn out to be Markov decision process model thus facilitating to apply learning algorithms.

Save Icon
Up Arrow
Open/Close