State-Action Frequencies Related to Strategies in Markov Decision Processes

O.J Vrieze

doi:10.1484/m.sths-eb.4.2017058

Abstract

An important class of Markov Decision Processes arises when the average reward criterion is chosen as overall reward function. For the average reward, the overall reward is defined as the Cesaro limit of the sequence of expected rewards along the different decision moments. It is well-known in Markov Decision Processes that the average reward function over the strategy space boils down to a linear function on the state-action frequencies, which are defined as the average number of times that the different state-action combinations occur in the infinite stream. Thus, optimization with respect to the average reward coincides with optimization of a linear function over the state-action frequencies space, provided that the "optimal" state-action frequency can be translated back to strategies. In this paper a procedure is developed that enables the translation of state-action frequencies into strategies. It will be shown that, generally, such a strategy consists of a kind of switching strategy. In first instance a certain stationary strategy should be applied, while at every decision moment a state-dependent lottery is performed, the outcome of which determines a switch to a stationary strategy that is applied for ever from that decision moment on.

Full Text