Discounted Cost Markov Decision Processes Research Articles

Approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain policies and lower bounds on the optimal policy cost of discounted-cost Markov decision processes (MDPs). Formulating an ALP requires (i) basis functions, the linear combination of which defines the VFA, and (ii) a state-relevance distribution, which determines the relative importance of different states in the ALP objective for the purpose of minimizing VFA error. Both of these choices are typically heuristic; basis function selection relies on domain knowledge, whereas the state-relevance distribution is specified using the frequency of states visited by a baseline policy. We propose a self-guided sequence of ALPs that embeds random basis functions obtained via inexpensive sampling and uses the known VFA from the previous iteration to guide VFA computation in the current iteration. In other words, this sequence takes multiple shots at randomly approximating the MDP value function with VFA-based guidance between consecutive approximation attempts. Self-guided ALPs mitigate domain knowledge during basis function selection and the impact of the state-relevance-distribution choice, thus reducing the ALP implementation burden. We establish high-probability error bounds on the VFAs from this sequence and show that a worst-case measure of policy performance is improved. We find that these favorable implementation and theoretical properties translate to encouraging numerical results on perishable inventory control and options pricing applications, where self-guided ALP policies improve upon policies from problem-specific methods. More broadly, our research takes a meaningful step toward application-agnostic policies and bounds for MDPs. This paper was accepted by Chung Piaw Teo, optimization. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2020.00038 .

Read full abstract

We are motivated by the need, in emergency situations, for impromptu (or “as-you-go”) deployment of multihop wireless networks, by human agents or robots (e.g., unmanned aerial vehicles (UAVs)); the agent moves along a line, makes wireless link quality measurements at regular intervals, and makes on-line placement decisions using these measurements. As a first step, we have formulated such deployment along a line as a sequential decision problem. In our earlier work, reported in [1] , we proposed two possible deployment approaches: (i) the pure as-you-go approach where the deployment agent can only move forward, and (ii) the explore-forward approach where the deployment agent explores a few successive steps and then selects the best relay placement location among them. The latter was shown to provide better performance (in terms of network cost, network performance, and power expenditure), but at the expense of more measurements and deployment time, which makes explore-forward impractical for quick deployment by an energy constrained agent such as a UAV. Further, since in emergency situations the terrain would be unknown, the deployment algorithm should not require a-priori knowledge of the parameters of the wireless propagation model. In [1] , we, therefore, developed learning algorithms for the explore-forward approach. The current paper fills in an important gap by providing deploy-and-learn algorithms for the pure as-you-go approach. We formulate the sequential relay deployment problem as an average cost Markov decision process (MDP), which trades off among power consumption, link outage probabilities, and the number of relay nodes in the deployed network. While the pure as-you-go deployment problem was previously formulated as a discounted cost MDP (see [1] ), the discounted cost MDP formulation was not amenable for learning algorithms that are proposed in this paper. In this paper, first we show structural results for the optimal policy corresponding to the average cost MDP, and provide new insights into the optimal policy. Next, by exploiting the special structure of the average cost optimality equation and by using the theory of asynchronous stochastic approximation (in single and two timescale), we develop two learning algorithms that asymptotically converge to the set of optimal policies as deployment progresses. Numerical results show reasonably fast speed of convergence, and hence the model-free algorithms can be useful for practical, fast deployment of emergency wireless networks.

Read full abstract

Discounted Cost Markov Decision Processes Research Articles

Articles published on Discounted Cost Markov Decision Processes

Self-Guided Approximate Linear Programs: Randomized Multi-Shot Approximation of Discounted Cost Markov Decision Processes

Optimal admission and queuing control with reneging behavior under premature discharge decisions

Optimal admission control under premature discharge decisions for operational effectiveness

Non-Asymptotic Analysis of Monte Carlo Tree Search

Asynchronous Stochastic Approximation Based Learning Algorithms for As-You-Go Deployment of Wireless Relay Networks Along a Line

Actor-Critic Algorithms with Online Feature Adaptation

Dynamic Energy Storage Control for Reducing Electricity Cost in Data Centers

Q-Learning Based Energy Management Policies for a Single Sensor Node with Finite Buffer

Reducing Electricity Cost of Smart Appliances via Energy Buffering Framework in Smart Grid

Optimizing contracted resource capacity with two advance cancelation modes

Markov decision processes with exponentially representable discounting

Discounted Cost Markov Decision Processes with a Constraint

Discounted Cost Markov Decision Processes on Borel Spaces: The Linear Programming Formulation

Suboptimal policy determination for large-scale Markov decision processes, Part 1: Description and bounds

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Discounted Cost Markov Decision Processes Research Articles

Articles published on Discounted Cost Markov Decision Processes

Self-Guided Approximate Linear Programs: Randomized Multi-Shot Approximation of Discounted Cost Markov Decision Processes

Optimal admission and queuing control with reneging behavior under premature discharge decisions

Optimal admission control under premature discharge decisions for operational effectiveness

Non-Asymptotic Analysis of Monte Carlo Tree Search

Asynchronous Stochastic Approximation Based Learning Algorithms for As-You-Go Deployment of Wireless Relay Networks Along a Line

Actor-Critic Algorithms with Online Feature Adaptation

Dynamic Energy Storage Control for Reducing Electricity Cost in Data Centers

Q-Learning Based Energy Management Policies for a Single Sensor Node with Finite Buffer

Reducing Electricity Cost of Smart Appliances via Energy Buffering Framework in Smart Grid

Optimizing contracted resource capacity with two advance cancelation modes

Markov decision processes with exponentially representable discounting

Discounted Cost Markov Decision Processes with a Constraint

Discounted Cost Markov Decision Processes on Borel Spaces: The Linear Programming Formulation

Suboptimal policy determination for large-scale Markov decision processes, Part 1: Description and bounds