The One-Warehouse Multi-Retailer (OWMR) system is the prototypical distribution and inventory system. Many OWMR variants exist, e.g. demand in excess of supply may be completely back-ordered, partially back-ordered, or lost. Prior research has focused on the study of heuristic reordering policies such as echelon base-stock levels coupled with heuristic allocation policies. Constructing well-performing policies is time-consuming and must be redone for every problem variant. By contrast, Deep Reinforcement Learning (DRL) is a general purpose technique for sequential decision making that has yielded good results for various challenging inventory systems. However, applying DRL to OWMR problems is nontrivial, since allocation involves setting a quantity for each retailer: The number of possible allocations grows exponentially in the number of retailers. Since each action is typically associated with a neural network output node, this renders standard DRL techniques intractable. Our proposed DRL algorithm instead inferences a multi-discrete action distribution which has output nodes that grow linearly in the number of retailers. Moreover, when total retailer orders exceed the available warehouse inventory, we propose a random rationing policy that substantially improves the ability of standard DRL algorithms to train good policies because it promotes the learning of feasible retailer order quantities. The resulting algorithm outperforms general-purpose benchmark policies by ∼1−3% for the lost sales case and by ∼12−20% for the partial back-ordering case. For complete back-ordering, the algorithm cannot consistently outperform the benchmark.
Read full abstract