Policy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

Quanxin Zhu,Xinsong Yang,Chuangxia Huang

doi:10.1155/2009/103723

Abstract

We study thepolicy iteration algorithm(PIA) for continuous-time jump Markov decision processes in general state and action spaces. The corresponding transition rates are allowed to beunbounded, and the reward rates may haveneither upper nor lower bounds. The criterion that we are concerned with isexpected average reward. We propose a set of conditions under which we first establish the average reward optimality equation and present the PIA. Then under twoslightlydifferent sets of conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation.

Highlights

In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes MDPs in general state and action spaces
In the previous sections we have studied the policy iteration algorithm PIA for average reward continuous-time jump MDPs in Polish spaces
Under two slightly different sets of conditions we have shown that the PIA yields the optimal maximum reward, an average optimal stationary policy, and a solution to the average reward optimality equation

Summary

Introduction

In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes MDPs in general state and action spaces. The approach to deal with this problem is by means of the well-known policy iteration algorithm PIA — known as Howard’s policy improvement algorithm. The PIA was originally introduced by Howard 1960 in 1 for finite MDPs i.e., the state and action spaces are both finite. By using the monotonicity of the sequence of iterated average rewards, he showed that the PIA converged with a finite number of steps. When a state space is not finite, there are well-known counterexamples to show that the PIA does not converge even though the action space is compact see 2–4 , e.g.,

Objectives

Results

Conclusion