Abstract

We present a novel online decision-making solution, where the optimal path of a given decision tree is dynamically found based on the contextual bandits analysis. At each round, the learner finds a path in the decision tree by making a sequence of decisions following the tree structure and receives an outcome when a terminal node is reached. At each decision node, the environment information is observed to hint on which child node to visit, resulting in a better outcome. The objective is to learn the context-specific optimal decision for each decision node to maximize the accumulated outcome. In this paper, we propose Dynamic Path Identifier (DPI), a learning algorithm where the contextual bandit is applied to every decision node, and the observed outcome is used as the reward of the previous decisions of the same round. The technical difficulty of DPI is the high exploration challenge caused by the width (i.e., the number of paths) of the tree as well as the large context space. We mathematically prove that DPI’s regret per round approached zero as the number of the rounds approaches infinity. We also prove that the regret is not a function of the number of paths in the tree. Numerical evaluations are provided to complement the theoretical analysis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call