Abstract

Research indicates that perturbation analysis (PA), Markov decision processes (MDP), and reinforcement learning (RL) are three closely-related areas in discrete event dynamic system optimization. In particular, it was shown that policy iteration in fact chooses the policy that has the steepest performance gradient (provided by PA) for the next iteration. This sensitivity point of view of MDP leads to some new research topics. We propose to implement policy iteration based on performance gradients. This approach is particularly useful when the actions at different states are correlated and hence the standard policy iteration cannot apply. We illustrate the main ideas with an example of an M/G/1/N queue and identify some further research topics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call