The relative value iteration scheme (RVI) for Markov decision processes (MDP) dates back to White (1963) , a seminal work, which introduced an algorithm for solving the ergodic dynamic programming equation for the finite state, finite action case. Its ramifications have given rise to popular learning algorithms (Q-learning). More recently, this algorithm gained prominence because of its implications for model predictive control (MPC). For stochastic control problems on an infinite time horizon, especially for problems that seek to optimize the average performance (ergodic control), obtaining the optimal policy in explicit form is only possible for a few classes of well-structured models. What is often used in practice is a heuristic method called the rolling horizon, or receding horizon, or MPC. This works as follows: one solves the finite horizon problem for a given number of steps N, or for an interval [0,T] in the case of a continuous time problem. The result is a nonstationary Markov policy, which is optimal for the finite horizon problem. We fix the initial action (this is the action determined at the Nth step of the value iteration (VI) algorithm) and apply it as a stationary Markov control. We refer to this Markov control as the rolling horizon control. This of course depends on the length of the horizon N. One expects that for well-structured problems, if N is sufficiently large, then the rolling horizon control is near optimal. Of course, this is a heuristic. The rolling horizon control might not even be stable. For a good discussion on this problem, we refer the reader to Della Vecchia et al. (2012) . Obtaining such solutions is further complicated by the fact that the value of the ergodic cost required in the successive iteration scheme is not known. This is the reason for the RVI.
Read full abstract