Information Relaxation Bounds for Partially Observed Markov Decision Processes

Martin B Haugh,Octavio Ruiz Lacedelli

doi:10.1109/tac.2019.2942528

Abstract

Partially observed Markov decision processes (POMDPs) are an important class of control problems that are ubiquitous in a wide range of fields. Unfortunately, these problems are generally intractable, so, in general, we must be satisfied with suboptimal policies. But how do we evaluate the quality of these policies? This question has been addressed in recent years in the Markov decision process (MDP) literature through the use of information-relaxation-based duality, where the nonanticipativity constraints are relaxed, but a penalty is imposed for violations of these constraints. In this paper, we extend the information relaxation approach to POMDPs. It is of course well known that the belief-state formulation of a POMDP is an MDP, and therefore, the previously developed results for MDPs also apply to POMDPs. Under the belief-state formulation, we use recently developed change-of-measure arguments to solve the so-called inner problems, and we use standard filtering arguments to identify the appropriate Radon-Nikodym derivatives. We also show, however, that dual bounds can also be constructed without resorting to the belief-state formulation. In this case, change-of-measure arguments are required for the evaluation of the so-called dual feasible penalties rather than for the solution of the inner problems. We compare dual bounds for both formulations and argue that, in general, the belief-state formulation provides tighter bounds. The second main contribution of this paper is to show that several value function approximations for POMDPs are in fact supersolutions. This is of interest because it can be particularly advantageous to construct penalties from supersolutions, since absolute continuity (of the change of measure) is no longer required, and therefore, significant variance reduction can be achieved when estimating the duality gap directly. Dual bounds constructed from supersolution-based penalties are also guaranteed to provide tighter bounds than the bounds provided by the supersolutions themselves. We use applications from robotic navigation and telecommunication to demonstrate our results.

Full Text