Introduction The inventory management of platelets is complicated by their short shelf-life. In many hospitals, ordering policies are set by staff based on their experience rather than through forecasting or mathematical modelling. A data-driven approach may help to reduce wastage while ensuring the right unit is available for the right patient at the right time. Finding optimal policies for managing perishable inventory is known to be computationally challenging due to the large number of “observation states” required to represent the age profile of the stock. Reinforcement learning (RL) is a subfield of machine learning in which agents learn how to solve a sequential decision-making task through interaction with an environment. Deep reinforcement learning (DRL) uses deep neural networks to efficiently learn a policy (a mapping between an observed state and an action) for problems with many observation states. We demonstrate, with both simulated and real-life demand data, that DRL can be used to learn effective platelet replenishment policies for a hospital blood bank. Methods We implemented the platelet replenishment scenario from a recent study (Rajendran & Srinivas, 2020) as an RL environment using the OpenAI gym Python package. Daily demand is stochastic, sampled from day-of-the-week specific Poisson distributions. The reward is the negative cost incurred, comprised of fixed and variable order costs, holding costs, shortage costs and wastage costs. We reimplemented the four heuristic replenishment policies described in that study, with policy parameters fit using stochastic mixed integer linear programming (SMILP), where the ordering decision is based on the total number of units in stock and the order quantity is either fixed or the difference between current stock and a target stock level. We trained DRL policies using two popular methods: Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) using the RL environment. We compared the performance of these six policies, in addition to the optimal policy found using value iteration (VI) and a policy with perfect foresight, on 1,000 randomly generated evaluation episodes each 365 days long. We repeated the analysis using real demand data obtained from the blood transfusion laboratory at University College London Hospital, a large tertiary level care hospital in the United Kingdom, fitting the policies using daily demand data from 2015 and 2016, and evaluating their performance on data from 2017. Results & Discussion The DRL policies incurred consistently lower mean daily costs on the simulated evaluation episodes than the four policies fit using SMILP. PPO incurred a lower mean daily cost than DQN in 96% of the evaluation episodes and performed near optimally - its mean daily cost was only 0.3% higher than that of the VI policy and 8% higher than the policy with perfect foresight. The best performing heuristic policy fit using SMILP, (s, S), incurred a mean daily cost 1.2% higher than the VI policy. The VI policy could be represented as an (s, S) heuristic policy, with different parameters to those found using SMILP. Therefore, in this case, the advantage of DRL over SMILP appears to be the fact it can efficiently learn from many more example sequences of demand, rather than its ability to represent more complex functions. PPO incurred the lowest mean daily cost on the real demand data from 2017 of the six SMILP and DRL policies, 10% higher than a policy with perfect foresight. Holding costs were the main difference between PPO and the policy with perfect foresight when using both simulated and real demand data. In both experiments PPO achieved mean wastage of 0% and suffered a shortage, which would have required placing an additional rush order, on 3.2% of days with simulated demand data and 2.7% of days with real demand data. Conclusion DRL can be used to learn near-optimal policies for a simplified platelet replenishment task, consistently outperforming a previously reported approach. This suggests it may be a viable method for finding policies that can be applied in practice to improve the management of platelet inventory and reduce wastage. In future work, the ability of DRL to learn how to act in large observation space states will enable consideration of additional aspects of the real problem, such as the fact that not all units arrive fresh and not all requested units are transfused, where existing methods become computationally infeasible or impractical.