Near-optimal Policy Research Articles

Introduction The inventory management of platelets is complicated by their short shelf-life. In many hospitals, ordering policies are set by staff based on their experience rather than through forecasting or mathematical modelling. A data-driven approach may help to reduce wastage while ensuring the right unit is available for the right patient at the right time. Finding optimal policies for managing perishable inventory is known to be computationally challenging due to the large number of “observation states” required to represent the age profile of the stock. Reinforcement learning (RL) is a subfield of machine learning in which agents learn how to solve a sequential decision-making task through interaction with an environment. Deep reinforcement learning (DRL) uses deep neural networks to efficiently learn a policy (a mapping between an observed state and an action) for problems with many observation states. We demonstrate, with both simulated and real-life demand data, that DRL can be used to learn effective platelet replenishment policies for a hospital blood bank. Methods We implemented the platelet replenishment scenario from a recent study (Rajendran & Srinivas, 2020) as an RL environment using the OpenAI gym Python package. Daily demand is stochastic, sampled from day-of-the-week specific Poisson distributions. The reward is the negative cost incurred, comprised of fixed and variable order costs, holding costs, shortage costs and wastage costs. We reimplemented the four heuristic replenishment policies described in that study, with policy parameters fit using stochastic mixed integer linear programming (SMILP), where the ordering decision is based on the total number of units in stock and the order quantity is either fixed or the difference between current stock and a target stock level. We trained DRL policies using two popular methods: Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) using the RL environment. We compared the performance of these six policies, in addition to the optimal policy found using value iteration (VI) and a policy with perfect foresight, on 1,000 randomly generated evaluation episodes each 365 days long. We repeated the analysis using real demand data obtained from the blood transfusion laboratory at University College London Hospital, a large tertiary level care hospital in the United Kingdom, fitting the policies using daily demand data from 2015 and 2016, and evaluating their performance on data from 2017. Results & Discussion The DRL policies incurred consistently lower mean daily costs on the simulated evaluation episodes than the four policies fit using SMILP. PPO incurred a lower mean daily cost than DQN in 96% of the evaluation episodes and performed near optimally - its mean daily cost was only 0.3% higher than that of the VI policy and 8% higher than the policy with perfect foresight. The best performing heuristic policy fit using SMILP, (s, S), incurred a mean daily cost 1.2% higher than the VI policy. The VI policy could be represented as an (s, S) heuristic policy, with different parameters to those found using SMILP. Therefore, in this case, the advantage of DRL over SMILP appears to be the fact it can efficiently learn from many more example sequences of demand, rather than its ability to represent more complex functions. PPO incurred the lowest mean daily cost on the real demand data from 2017 of the six SMILP and DRL policies, 10% higher than a policy with perfect foresight. Holding costs were the main difference between PPO and the policy with perfect foresight when using both simulated and real demand data. In both experiments PPO achieved mean wastage of 0% and suffered a shortage, which would have required placing an additional rush order, on 3.2% of days with simulated demand data and 2.7% of days with real demand data. Conclusion DRL can be used to learn near-optimal policies for a simplified platelet replenishment task, consistently outperforming a previously reported approach. This suggests it may be a viable method for finding policies that can be applied in practice to improve the management of platelet inventory and reduce wastage. In future work, the ability of DRL to learn how to act in large observation space states will enable consideration of additional aspects of the real problem, such as the fact that not all units arrive fresh and not all requested units are transfused, where existing methods become computationally infeasible or impractical.

A key challenge in inventory management is to identify policies that optimally replenish inventory from multiple suppliers. To solve such optimization problems, inventory managers need to decide what quantities to order from each supplier given the net inventory and outstanding orders so that the expected backlogging, holding, and sourcing costs are jointly minimized. Inventory management problems have been studied extensively for more than 60 years, and yet even basic dual-sourcing problems, in which orders from an expensive supplier arrive faster than orders from a regular supplier, remain intractable in their general form. In addition, there is an emerging need to develop proactive, scalable optimization algorithms that can adjust their recommendations to dynamic demand shifts in a timely fashion. In this work, we approach dual sourcing from a neural network–based optimization lens and incorporate information on inventory dynamics and its replenishment (i.e., control) policies into the design of recurrent neural networks. We show that the proposed neural network controllers (NNCs) are able to learn near-optimal policies of commonly used instances within a few minutes of CPU time on a regular personal computer. To demonstrate the versatility of NNCs, we also show that they can control inventory dynamics with empirical, nonstationary demand distributions that are challenging to tackle effectively using alternative, state-of-the-art approaches. Our work shows that high-quality solutions of complex inventory management problems with nonstationary demand can be obtained with deep neural network optimization approaches that directly account for inventory dynamics in their optimization process. As such, our research opens up new ways of efficiently managing complex, high-dimensional inventory dynamics. History: Accepted by Ram Ramesh, Area Editor for Data Science & Machine Learning. Funding: This work was supported by Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (NCCR Automation) [Grant P2EZP2 191888] and the Army Research Office [Grant W911NF-23-1-0129]. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2022.0136 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2022.0136 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .

Near-optimal Policy Research Articles

Related Topics

Articles published on Near-optimal Policy

Reinforcement Learning for Jointly Optimal Coding and Control Policies for a Markovian System Controlled over a Communication Channel

Lifetime policy reuse and the importance of task capacity

Active flow control for bluff body drag reduction using reinforcement learning with partial measurements

Rule-based shields embedded safe reinforcement learning approach for electric vehicle charging control

An Approximate Dynamic Programming Approach to Dynamic Stochastic Matching

Reinforcement Learning Building Control: An Online Approach With Guided Exploration Using Surrogate Models

Identity concealment games: How I learned to stop revealing and love the coincidences

A Fast Online Planning Under Partial Observability Using Information Entropy Rewards

Energy Efficient Power Allocation in Massive MIMO Based on Parameterized Deep DQN

Deep Reinforcement Learning for Managing Platelets in a Hospital Blood Bank

Control of Dual-Sourcing Inventory Systems Using Recurrent Neural Networks

A stochastic policy algorithm for seasonal hydropower planning

Service-oriented container slot allocation policy under stochastic demand

Index policy for multiarmed bandit problem with dynamic risk measures

Computably Continuous Reinforcement-Learning Objectives Are PAC-Learnable

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Optimal policies for Bayesian olfactory search in turbulent flows.

Real-Time Spatial–Intertemporal Pricing and Relocation in a Ride-Hailing Network: Near-Optimal Policies and the Value of Dynamic Pricing

Query-Age-Optimal Scheduling Under Sampling and Transmission Constraints

Reinforcement Learning for Adaptive Optimal Stationary Control of Linear Stochastic Systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Near-optimal Policy Research Articles

Related Topics

Articles published on Near-optimal Policy

Reinforcement Learning for Jointly Optimal Coding and Control Policies for a Markovian System Controlled over a Communication Channel

Lifetime policy reuse and the importance of task capacity

Active flow control for bluff body drag reduction using reinforcement learning with partial measurements

Rule-based shields embedded safe reinforcement learning approach for electric vehicle charging control

An Approximate Dynamic Programming Approach to Dynamic Stochastic Matching

Reinforcement Learning Building Control: An Online Approach With Guided Exploration Using Surrogate Models

Identity concealment games: How I learned to stop revealing and love the coincidences

A Fast Online Planning Under Partial Observability Using Information Entropy Rewards

Energy Efficient Power Allocation in Massive MIMO Based on Parameterized Deep DQN

Deep Reinforcement Learning for Managing Platelets in a Hospital Blood Bank

Control of Dual-Sourcing Inventory Systems Using Recurrent Neural Networks

A stochastic policy algorithm for seasonal hydropower planning

Service-oriented container slot allocation policy under stochastic demand

Index policy for multiarmed bandit problem with dynamic risk measures

Computably Continuous Reinforcement-Learning Objectives Are PAC-Learnable

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Optimal policies for Bayesian olfactory search in turbulent flows.

Real-Time Spatial–Intertemporal Pricing and Relocation in a Ride-Hailing Network: Near-Optimal Policies and the Value of Dynamic Pricing

Query-Age-Optimal Scheduling Under Sampling and Transmission Constraints

Reinforcement Learning for Adaptive Optimal Stationary Control of Linear Stochastic Systems