Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management

Shipra Agrawal,Randy Jia

doi:10.1287/opre.2022.2263

Abstract

A fundamental yet notoriously difficult problem in operations management is the periodic inventory control problem under positive lead time and lost sales. More recently, there has been interest in the problem setting where the demand distribution is not known a priori and must be learned from the observations made during the decision-making process. In “Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management,” Agrawal and Jia present a reinforcement learning algorithm that uses the observed outcomes of past decisions to implicitly learn the underlying dynamics and adaptively improve the decision-making strategy over time. They show that, compared with the best base-stock policy, their algorithm achieves an optimal regret bound in terms of the time horizon and scales linearly with the lead time of the inventory ordering process. Furthermore, they demonstrate that their approach is not restricted to the inventory problem and can be applied in an almost black box manner to more general reinforcement learning problems with convex cost functions.

Full Text