Abstract

This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP). It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy space. Unlike the traditional Markov decision process, the cost function in the variance criterion will be affected by future actions. To this end, we convert the variance minimization problem into a standard (MDP) by introducing a concept called pseudo-variance. Further, by giving the policy iterative algorithm of pseudo-variance optimization problem, the optimal policy of the original variance optimization problem is derived, and a sufficient condition for the variance optimal policy is given. Finally, we use an example to illustrate the conclusion of this paper.

Highlights

  • The Postal Service Company’s catalogue information system, inventory issues, and supply chain management issues are all early successful applications of the Markov decision process

  • This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP)

  • It is assumed that the state space is countable and the action space is Borel measurable space

Read more

Summary

Introduction

The Postal Service Company’s catalogue information system, inventory issues, and supply chain management issues are all early successful applications of the Markov decision process. This paper considers the variance optimization problem of average reward in continuous-time Markov decision process. For continuous-time, the variance of the average expected return has been defined in deterministic stationary policy. The variance function of average expected return of the continuous-time in this paper is given by. The main work of this paper is to find the iterative algorithm of the optimal policy under the variance criterion (minimum variance) in the countable state space and the Borel measurable action space. We use a unique invariant probability measure of Markov chain to denote the average expected return and variance. To this end, we will impose the following Assumption 2, 3, 4. We develop a continuous time MDP policy iterative algorithm to get the optimal strategy, and we prove the convergence of the policy iterative algorithm

Model and Optimization Criteria
Analysis and Optimization
Pseudo-Variance Minimization
Variance Minimization
Examples
Discussion and Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.