Abstract

In recent research we find that the policy iteration algorithm for Markov decision processes (MDPs) is a natural consequence of the performance difference formula that compares the difference of the performance of two different policies. In this paper, we extend this idea to the bias-optimal policy of MDPs. We first derive a formula that compares the biases of any two policies which have the same gains, and then we show that a policy iteration algorithm leading to a bias-optimal policy follows naturally from this bias difference formula. Our results extend those in (Lewis & Puterman, 2001) to the multichain case and provide a simple and intuitive explanation for the mathematics in (Veinott, 1966 Veinott, 1969). The results also confirm the idea that the solutions to performance (including bias) optimal problems can be obtained from performance sensitivity formulas.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call