Abstract

We consider finite-stage Markov decision processes (MDPs) under incomplete information, where the decision-maker only knows that the true transition probability and reward matrices belong to given, finite sets. The decision-maker interacts with the system over a finite number of episodes. The first episode begins with a probabilistic belief about the true probability and reward matrices. This belief is updated at the end of each episode using observed events. The goal is to maximize the expected total reward earned over all episodes. In the resulting model-based episodic Bayesian MDP, it suffices to only consider (the known) policies that are optimal to each one of the possible probability and reward matrices. Nevertheless, the decision-maker should execute policies that provide information about the true probabilities and rewards (exploration), but also exploit this knowledge to increase rewards. We propose a framework called information-directed policy sampling (IDPS). In each episode, the decision-maker balances the exploitation-exploration trade-off by executing a randomized policy that minimizes a so-called convex information ratio. We derive a regret bound that is independent of state- and action-space cardinalities when the set of matrices is exogenously determined. Numerical experiments show IDPS outperforming a state-of-the-art approach called Posterior Sampling.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.