We consider a stochastic lost-sales inventory control system with lead time L over a planning horizon T. Supply is uncertain, and it is a function of the order quantity (because of random yield/capacity, etc.). We aim to minimize the T-period cost, a problem that is known to be computationally intractable even under known distributions of demand and supply. In this paper, we assume that both the demand and supply distributions are unknown and develop a computationally efficient online learning algorithm. We show that our algorithm achieves a regret (i.e., the performance gap between the cost of our algorithm and that of an optimal policy over T periods) of [Formula: see text] when [Formula: see text]. We do so by (1) showing that our algorithm’s cost is higher by at most [Formula: see text] for any [Formula: see text] compared with an optimal constant-order policy under complete information (a widely used algorithm) and (2) leveraging the latter’s known performance guarantee from the existing literature. To the best of our knowledge, a finite sample [Formula: see text] (and polynomial in L) regret bound when benchmarked against an optimal policy is not known before in the online inventory control literature. A key challenge in this learning problem is that both demand and supply data can be censored; hence, only truncated values are observable. We circumvent this challenge by showing that the data generated under an order quantity q2 allow us to simulate the performance of not only q2 but also, q1 for all [Formula: see text], a key observation to obtain sufficient information even under data censoring. By establishing a high-probability coupling argument, we are able to evaluate and compare the performance of different order policies at their steady state within a finite time horizon. Because the problem lacks convexity, commonly used learning algorithms, such as stochastic gradient decent and bisection, cannot be applied, and instead, we develop an active elimination method that adaptively rules out suboptimal solutions. This paper was accepted by Victor Martínez-de-Albéniz, operations management. Funding: This work is supported by the National Science Foundation [Grant CCF-2312205]. Z. Zhou also acknowledges the New York University’s 2024 Center for Global Economy and Business [Faculty Research Grant] and New York University [Research Catalyst Prize]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/mnsc.2022.02476 .
Read full abstract