In this paper, we consider a classic periodic-review lost-sales inventory system with lead times, which is notoriously challenging to optimize with a wide range of real-world applications. We consider a joint learning and optimization problem in which the decision-maker does not know the demand distribution a priori and can only use past sales information (i.e., censored demand). Departing from existing learning algorithms on this learning problem (e.g., Huh et al. 2009a, Agrawal and Jia 2019, Zhang et al. 2020) that require the convexity property of the underlining system, we develop an Upper Confidence Bound (UCB)-type learning framework and show it can be applied to the learning of not only the optimal base-stock policy, but also the optimal capped base-stock policy in which the convexity property no longer holds. Compared with a classic multi-armed bandit problem, our problem has unique challenges due to the nature of the inventory system, because (1) each action has long-term impacts on future costs, and (2) the system state space is exponentially large in the lead time. Hence, our learning algorithms are not naive adoptions of the classic UCB algorithm: the design of the simulation and averaging steps is novel in our algorithms, and the confidence width in the UCB index is also different from the classic one. We prove the regrets of our learning algorithms are tight, up to a logarithmic term, in the planning horizon T. Our extensive numerical experiments suggest the proposed algorithms (almost) dominate existing learning algorithms. We also propose a practical way to select which learning algorithm to use with limited demand data.
Read full abstract