Abstract
We study thepolicy iteration algorithm(PIA) for continuous-time jump Markov decision processes in general state and action spaces. The corresponding transition rates are allowed to beunbounded, and the reward rates may haveneither upper nor lower bounds. The criterion that we are concerned with isexpected average reward. We propose a set of conditions under which we first establish the average reward optimality equation and present the PIA. Then under twoslightlydifferent sets of conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation.
Highlights
In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes MDPs in general state and action spaces
In the previous sections we have studied the policy iteration algorithm PIA for average reward continuous-time jump MDPs in Polish spaces
Under two slightly different sets of conditions we have shown that the PIA yields the optimal maximum reward, an average optimal stationary policy, and a solution to the average reward optimality equation
Summary
In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes MDPs in general state and action spaces. The approach to deal with this problem is by means of the well-known policy iteration algorithm PIA — known as Howard’s policy improvement algorithm. The PIA was originally introduced by Howard 1960 in 1 for finite MDPs i.e., the state and action spaces are both finite. By using the monotonicity of the sequence of iterated average rewards, he showed that the PIA converged with a finite number of steps. When a state space is not finite, there are well-known counterexamples to show that the PIA does not converge even though the action space is compact see 2–4 , e.g.,
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have