A Review of Reinforcement Learning in Financial Applications
In recent years, there has been a growing trend of applying reinforcement learning (RL) in financial applications. This approach has shown great potential for decision-making tasks in finance. In this review, we present a comprehensive study of the applications of RL in finance and conduct a series of meta-analyses to investigate the common themes in the literature, such as the factors that most significantly affect RL's performance compared with traditional methods. Moreover, we identify challenges, including explainability, Markov decision process modeling, and robustness, that hinder the broader utilization of RL in the financial industry and discuss recent advancements in overcoming these challenges. Finally, we propose future research directions, such as benchmarking, contextual RL, multi-agent RL, and model-based RL, to address these challenges and to further enhance the implementation of RL in finance.
- # Reinforcement Learning
- # Markov Decision Process Modeling
- # Financial Applications
- # Reinforcement Learning's Performance
- # Multi-agent Reinforcement Learning
- # Financial Industry
- # Implementation Of Reinforcement Learning
- # Applications Of Reinforcement Learning
- # Model-based Reinforcement Learning
- # Decision Process Modeling
- Conference Article
7
- 10.1109/reepe57272.2023.10086785
- Mar 16, 2023
The article is devoted to the study of machine learning approaches in the processing of the user interface of mobile ecosystems in order to test the approach of adaptive interfaces, which formalizes them as a stochastic sequential decision problem, as well as the use of multi-agent model-based reinforcement learning for adaptation planning. This article introduces for the first time the use of reinforcement learning in a mobile adaptive user interface. The article presents adaptation options based on changing the representation of elements, as well as a transition function for the model of Markov decision processes. This article proposes a novel method called MARLMUI, a Multi-Agent Reinforcement Learning Mobile User Interface. In conclusion, Dec-POMDP, a decentralized partially observable MDP model, is considered as a proposed interface processing algorithm based on multi-agent reinforcement learning. This study is first attempt to systematize knowledge and practically implement an adaptive interface in the mobile ecosystem.
- Research Article
53
- 10.1109/tccn.2019.2933420
- Dec 1, 2019
- IEEE Transactions on Cognitive Communications and Networking
We aim to jointly optimize antenna tilt angle, and vertical and horizontal half-power beamwidths of the macrocells in a heterogeneous cellular network (HetNet). The interactions between the cells, most notably due to their coupled interference render this optimization prohibitively complex. Utilizing a single agent reinforcement learning (RL) algorithm for this optimization becomes quite suboptimum despite its scalability, whereas multi-agent RL algorithms yield better solutions at the expense of scalability. Hence, we propose a two-step compromise algorithm. Specifically, a multi-agent mean field RL algorithm is first utilized in the offline phase so as to transfer information as features for the second (online) phase single agent RL algorithm, which employs a deep neural network to learn users locations. This two-step approach is a practical solution for real deployments, which should automatically adapt to environmental changes in the network. Our results illustrate that the proposed algorithm approaches the performance of the multi-agent RL, which requires millions of trials, with hundreds of online trials, assuming relatively low environmental dynamics, and performs much better than a single agent RL. Furthermore, the proposed algorithm is compact and implementable, and empirically appears to provide a performance guarantee regardless of the amount of environmental dynamics.
- Book Chapter
213
- 10.1007/978-3-642-27645-3_14
- Jan 1, 2012
Reinforcement Learning was originally developed for Markov Decision Processes (MDPs). It allows a single agent to learn a policy that maximizes a possibly delayed reward signal in a stochastic stationary environment. It guarantees convergence to the optimal policy, provided that the agent can sufficiently experiment and the environment in which it is operating is Markovian. However, when multiple agents apply reinforcement learning in a shared environment, this might be beyond the MDP model. In such systems, the optimal policy of an agent depends not only on the environment, but on the policies of the other agents as well. These situations arise naturally in a variety of domains, such as: robotics, telecommunications, economics, distributed control, auctions, traffic light control, etc. In these domains multi-agent learning is used, either because of the complexity of the domain or because control is inherently decentralized. In such systems it is important that agents are capable of discovering good solutions to the problem at hand either by coordinating with other learners or by competing with them. This chapter focuses on the application reinforcement learning techniques in multi-agent systems. We describe a basic learning framework based on the economic research into game theory, and illustrate the additional complexity that arises in such systems. We also described a representative selection of algorithms for the different areas of multi-agent reinforcement learning research.
- Research Article
10
- 10.1360/ssi-2020-0180
- May 1, 2022
- SCIENTIA SINICA Informationis
Reinforcement learning (RL) technology has been successfully applied to various continuous decision environments in decades of development. Nowadays, RL is attracting more attention, even being touted as one of the closest approaches to general artificial intelligence. However, real-world problems often involve multiple intelligent agents interacting with each other. Thus, we focus on multi-agent reinforcement learning (MARL) to deal with such multi-agent systems in practice. In the past decade, the combination of multi-agent system and RL has become increasingly close, gradually forming and enriching the research field of MARL. Reviewing the studies on MARL, we found that researchers mainly solve MARL problems from three perspectives: learning framework, joint action learning, and communication-based MARL. In this paper, we focus from the studies on the communication perspective. We first state the reasons for choosing communication-based MARL and then list the president studies falling into the MARL category but different in nature. We hope that this article can provide a reference for developing MARL methods that can solve practical problems for the national welfare.
- Research Article
8
- 10.1016/j.ijepes.2022.108848
- Dec 5, 2022
- International Journal of Electrical Power & Energy Systems
A reactive power optimization partially observable Markov decision process with data uncertainty using multi-agent actor-attention-critic algorithm
- Conference Article
1
- 10.1109/sice.2007.4421462
- Sep 1, 2007
This paper describes a new method of dynamic programming (DP) based multiagent reinforcement learning in Markov decision process (MDP) model. It is difficult for agents to learn cooperative actions among agents properly in multiagent because they may change each policy in same time. To solve this problem, each agent should learn in different time for each policy improvement. Therefore, we propose multiple timescales policy improvement method. We show comparative experiments between multiple timescales policy improvement and exclusive policy improvement. As a result, our methods reduced the search costs for the optimal common-payoff Nash solution.
- Conference Article
- 10.65109/ptzx6262
- May 6, 2024
In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as Model Transfer Reinforcement Learning (MTRL) problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a constrained Maximum Likelihood Estimation (MLE) -based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP.
- Research Article
36
- 10.1016/j.eswa.2022.117932
- Jun 23, 2022
- Expert Systems with Applications
Reinforcement learning-based expanded personalized diabetes treatment recommendation using South Korean electronic health records
- Research Article
24
- 10.3390/buildings12050641
- May 11, 2022
- Buildings
Truss layout optimization under complex constraints has been a hot and challenging problem for decades that aims to find the optimal node locations, connection topology between nodes, and cross-sectional areas of connecting bars. Monte Carlo Tree Search (MCTS) is a reinforcement learning search technique that is competent to solve decision-making problems. Inspired by the success of AlphaGo using MCTS, the truss layout problem is formulated as a Markov Decision Process (MDP) model, and a 2-stage MCTS-based algorithm, AlphaTruss, is proposed for generating optimal truss layout considering topology, geometry, and bar size. In this MDP model, three sequential action sets of adding nodes, adding bars, and selecting sectional areas greatly expand the solution space and the reward function gives feedback to actions according to both geometric stability and structural simulation. To find the optimal sequential actions, AlphaTruss solves the MDP model and gives the best decision in each design step by searching and learning through MCTS. Compared with existing results from the literature, AlphaTruss exhibits better performance in finding the truss layout with the minimum weight under stress, displacement, and buckling constraints, which verifies the validity and efficiency of the established algorithm.
- Research Article
1
- 10.5555/1016416.1016419
- Dec 1, 2003
This paper presents a multi-agent reinforcement learning bidding approach (MARLBS) for performing multi-agent reinforcement learning. MARLBS integrates reinforcement learning, bidding and genetic a...
- Single Report
- 10.62311/nesx/rriv225
- Mar 19, 2025
Abstract: Optimal control and reinforcement learning (RL) are foundational techniques for intelligent decision-making in robotics, automation, and AI-driven control systems. This research explores the theoretical principles, computational algorithms, and real-world applications of optimal control and reinforcement learning, emphasizing their convergence for scalable and adaptive robotic automation. Key topics include dynamic programming, Hamilton-Jacobi-Bellman (HJB) equations, policy optimization, model-based RL, actor-critic methods, and deep RL architectures. The study also examines trajectory optimization, model predictive control (MPC), Lyapunov stability, and hierarchical RL for ensuring safe and robust control in complex environments. Through case studies in self-driving vehicles, autonomous drones, robotic manipulation, healthcare robotics, and multi-agent systems, this research highlights the trade-offs between model-based and model-free approaches, as well as the challenges of scalability, sample efficiency, hardware acceleration, and ethical AI deployment. The findings underscore the importance of hybrid RL-control frameworks, real-world RL training, and policy optimization techniques in advancing robotic intelligence and autonomous decision-making. Keywords: Optimal control, reinforcement learning, model-based RL, model-free RL, dynamic programming, policy optimization, Hamilton-Jacobi-Bellman equations, actor-critic methods, deep reinforcement learning, trajectory optimization, model predictive control, Lyapunov stability, hierarchical RL, multi-agent RL, robotics, self-driving cars, autonomous drones, robotic manipulation, AI-driven automation, safety in RL, hardware acceleration, sample efficiency, hybrid RL-control frameworks, scalable AI.
- Research Article
14
- 10.3390/math10152699
- Jul 30, 2022
- Mathematics
In the event of a disaster, the road network is often compromised in terms of its capacity and usability conditions. This is a challenge for humanitarian operations in the context of delivering critical medical supplies. To optimise vehicle routing for such a problem, a Multi-Depot Dynamic Vehicle-Routing Problem with Stochastic Road Capacity (MDDVRPSRC) is formulated as a Markov Decision Processes (MDP) model. An Approximate Dynamic Programming (ADP) solution method is adopted where the Post-Decision State Rollout Algorithm (PDS-RA) is applied as the lookahead approach. To perform the rollout effectively for the problem, the PDS-RA is executed for all vehicles assigned for the problem. Then, at the end, a decision is made by the agent. Five types of constructive base heuristics are proposed for the PDS-RA. First, the Teach Base Insertion Heuristic (TBIH-1) is proposed to study the partial random construction approach for the non-obvious decision. The heuristic is extended by proposing TBIH-2 and TBIH-3 to show how Sequential Insertion Heuristic (SIH) (I1) as well as Clarke and Wright (CW) could be executed, respectively, in a dynamic setting as a modification to the TBIH-1. Additionally, another two heuristics: TBIH-4 and TBIH-5 (TBIH-1 with the addition of Dynamic Lookahead SIH (DLASIH) and Dynamic Lookahead CW (DLACW) respectively) are proposed to improve the on-the-go constructed decision rule (dynamic policy on the go) in the lookahead simulations. The results obtained are compared with the matheuristic approach from previous work based on PDS-RA.
- Research Article
73
- 10.1016/j.neucom.2021.09.044
- Sep 27, 2021
- Neurocomputing
Multi-target tracking for unmanned aerial vehicle swarms using deep reinforcement learning
- Research Article
- 10.3390/jmse14010055
- Dec 28, 2025
- Journal of Marine Science and Engineering
The continued growth of international maritime trade has driven automated container terminals (ACTs) to pursue more efficient operational management strategies. In practice, the horizontal yard layout in ACTs significantly enhances transshipment efficiency. However, the more complex horizontal transporting system calls for an effective approach to enhance automated guided vehicle (AGV) scheduling. Considering AGV charging and path conflicts, this paper proposes a multi-agent reinforcement learning (MARL) approach to address the AGV dispatching and path planning (VD2P) problem under a horizontal layout. The VD2P problem is formulated as a Markov decision process model. To mitigate the challenges of high-dimensional state-action space, a multi-agent framework is developed to control the AGV dispatching and path planning separately. A mixed global–individual reward mechanism is tailored to enhance both exploration and corporation. A proximal policy optimization method is used to train the scheduling policies. Experiments indicate that the proposed MARL approach can provide high-quality solutions for a real-world-sized scenario within tens of seconds. Compared with benchmark methods, the proposed approach achieves an improvement of 8.4% to 53.8%. Moreover, sensitivity analyses are conducted to explore the impact of different AGV configurations and charging strategies on scheduling. Managerial insights are obtained to support more efficient terminal operations.
- Book Chapter
17
- 10.1007/978-981-10-3433-6_56
- Jan 1, 2016
We propose an innovative approach towards Cooperation Models for Multi-agent Reinforcement Learning (CMMARL) using reinforcement learning methods. Communication methods for reinforcement learning depend on multiagent scheme is proposed & implemented. Different cooperation methods for cooperative reinforcement learning based on expertness measure of each agent proposed here i.e. group method, dynamic method, goal-oriented method and expert agent method. Implementation results have demonstrated that the suggested communication and cooperation methods are able to accelerate the aggregation of the agents that accomplish best action strategies. This approach is developed for dynamic products availability in a three retailer shops in the market. Retailers can cooperate with each other and can get benefit from cooperative information by their own policies that accurately represent their goals and interests. The retailers are the learning agents in the problem and apply reinforcement learning to learn cooperatively from the situation. By making considerable theory on the dealer’s inventory strategy, refill period, and entry procedure of the customers, the problem turn out to be Markov decision process model thus facilitating to apply learning algorithms.