Abstract

Certain methods for constructing embedded Markov decision processes (MDPs) lead to performance measures that are the ratio of two long-run averages. For such MDPs with finite state and action spaces and under an ergodicity assumption, this note presents algorithms for computing optimal policies based on policy iterations, linear programming, value iterations and Q-learning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call