A nested Markov decision process (NMDP) is a multi-level MDP consisting of an outer MDP and several inner MDPs. These MDPs are dependent on each other such that, each state of the outer MDP induces a unique inner MDP. We propose for the first time an algorithm to solve an infinite horizon nested Markov decision process under the average reward criterion (NMDP-HA). The algorithm incorporates the policy iteration method, which is composed of the value-determination operation and the policy improvement routine. To evaluate the solution quality and computational complexity of the NMDP-HA, we develop a specialized enumerative algorithm adapted from a completely observable MDP equivalent of the NMDP problem. The proposed NMDP-HA is illustrated with several numerical examples and our results for the problem instances evaluated indicate that the heuristic algorithm can find the same optimal solution in a fraction of the time the total enumeration algorithm uses to exhaustively search the entire solution space. For the cases, where the optimal solution is not found, the percentage deviation from the optimal solution is less than 5%.
Read full abstract