Abstract

An important class of Markov Decision Processes arises when the average reward criterion is chosen as overall reward function. For the average reward, the overall reward is defined as the Cesaro limit of the sequence of expected rewards along the different decision moments. It is well-known in Markov Decision Processes that the average reward function over the strategy space boils down to a linear function on the state-action frequencies, which are defined as the average number of times that the different state-action combinations occur in the infinite stream. Thus, optimization with respect to the average reward coincides with optimization of a linear function over the state-action frequencies space, provided that the "optimal" state-action frequency can be translated back to strategies. In this paper a procedure is developed that enables the translation of state-action frequencies into strategies. It will be shown that, generally, such a strategy consists of a kind of switching strategy. In first instance a certain stationary strategy should be applied, while at every decision moment a state-dependent lottery is performed, the outcome of which determines a switch to a stationary strategy that is applied for ever from that decision moment on.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.