Abstract
A nested Markov decision process (NMDP) is a multi-level MDP consisting of an outer MDP and several inner MDPs. These MDPs are dependent on each other such that, each state of the outer MDP induces a unique inner MDP. We propose for the first time an algorithm to solve an infinite horizon nested Markov decision process under the average reward criterion (NMDP-HA). The algorithm incorporates the policy iteration method, which is composed of the value-determination operation and the policy improvement routine. To evaluate the solution quality and computational complexity of the NMDP-HA, we develop a specialized enumerative algorithm adapted from a completely observable MDP equivalent of the NMDP problem. The proposed NMDP-HA is illustrated with several numerical examples and our results for the problem instances evaluated indicate that the heuristic algorithm can find the same optimal solution in a fraction of the time the total enumeration algorithm uses to exhaustively search the entire solution space. For the cases, where the optimal solution is not found, the percentage deviation from the optimal solution is less than 5%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.