Abstract

In this paper, we study the variance optimization problem in Markov decision processes (MDP). The objective is to find the optimal policy which has the minimal average variance of the system rewards. As the variance function is quadratic and the variance of rewards are correlated mutually, the associated variance minimization problem is not a linear program. The traditional approaches of classical MDP theory, which are good at solving linear problems, are inapplicable to this problem. In this paper, we define a fundamental quantity called variance potential and derive a variance difference equation which quantifies the difference of variances of Markov systems under any two policies. Based on the variance difference equation, we propose an iterative algorithm, which is similar to the policy iteration in classical MDP theory, to reduce the reward variance of Markov systems. Although this algorithm converges to a local optimum, it is very efficient compared with the traditional gradient-based algorithms. Numerical experiments demonstrate the main idea of this paper.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call