Average Reward Markov Decision Process Research Articles

Abstract Within the realm of Discrete Event Systems (DES) theory, the problem of performance optimization for many applications can be modeled as an infinite-horizon, average-reward Markov Decision Process (MDP) with a finite state space. In principle, these MDPs can be solved by various well-developed methods like value iteration, policy iteration and linear programming. But in reality, the tractability of these methods in the context of the aforementioned applications is compromised by the explosive size of the underlying state spaces, a problem that is known as “the curse of dimensionality”. Hence, the corresponding performance optimization problems are frequently addressed by heuristic control policies. The considered work uses results from (i) the sensitivity analysis of Markov reward processes and (ii) the ranking & selection theory in statistics in order to develop a methodology for assessing the optimality of isolated decisions in the context of any well-defined heuristic control policy for the aforementioned MDPs. It also determines an improved decision when the current one is found to be suboptimal. Hence, when embedded in an iterative scheme, this methodology can support the incremental enhancement of the original heuristic policy in a way that controls, both, the computational and also the representational complexity of the new policy. Finally, an additional important feature of the presented methodology is that it can be executed either in an “off-line” mode, using a simulation of the dynamics of the underlying DES, or in an “on-line” mode, based on the sample path that is defined by the real-time dynamics of the controlled system.

We propose a novel scheme for delay-optimal scheduling in multi-user multi-relay cellular wireless networks. The cell area is divided into several sectors, each serviced by an individual relay station (RS). In order to have simultaneous transmissions by the users in neighbouring sectors, we assume that users of each individual sector use separate set of orthogonal channels to communicate with the RS and the base station (BS). Moreover, a separate orthogonal channel is shared among relays for transmission to the BS. For uplink communication, users are allowed to choose between two modes of transmission, namely, direct transmission mode and relayed transmission mode through a simple transmission mode selection algorithm. Users are allocated fractions of the time-slot for the first phase of transmission (from the users to the BS and the RSs) in a time-division multiple access (TDMA) fashion. For the second phase of transmission (from the RSs to the BS), each RS is allocated a fraction of the time-slot. We model the problem of end-to-end (e2e) delay-optimal scheduling as an infinite-horizon average reward Markov decision process (MDP) for users and relays in two separate stages. An online learning approach is then employed to solve the problem in a distributed manner for both users and relays in each phase of transmission. The proposed online stochastic learning solution converges to the optimal solution almost surely (with probability 1) under some realistic conditions. Simulation results show that the proposed approach outperforms the conventional scheduling schemes.

Average Reward Markov Decision Process Research Articles

Articles published on Average Reward Markov Decision Process

Logarithmic Regret Bounds for Continuous-Time Average-Reward Markov Decision Processes

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

Relative Q-learning for Average-Reward Markov Decision Processes with Continuous States

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

Energy management for age of information control in solar-powered IoT end devices

Relative Q-Learning for Average-Reward Markov Decision Processes with Continuous States

Incremental Improvements of Heuristic Policies for Average-Reward Markov Decision Processes

Balanced Dynamic Planning in Green Heterogeneous Cellular Networks

Location Aware Opportunistic Bandwidth Sharing between Static and Mobile Users with Stochastic Learning in Cellular Networks

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

Mode Selection and Resource Allocation in Device-to-Device Communications With User Arrivals and Departures

Delay-Aware Two-Hop Cooperative Relay Communications via Approximate MDP and Stochastic Learning

Delay-Optimal Distributed Scheduling in Multi-User Multi-Relay Cellular Wireless Networks

Stochastic Dominance-Constrained Markov Decision Processes

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

Policy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

Fuzzy optimality relation for perceptive MDPs—the average case

Achieving Target State-Action Frequencies in Multichain Average-Reward Markov Decision Processes

The LP approach in average reward MDPs with multiple cost constraints: The countable state case

The policy iteration algorithm for average reward Markov decision processes with general state space

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Average Reward Markov Decision Process Research Articles

Articles published on Average Reward Markov Decision Process

Logarithmic Regret Bounds for Continuous-Time Average-Reward Markov Decision Processes

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

Relative Q-learning for Average-Reward Markov Decision Processes with Continuous States

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

Energy management for age of information control in solar-powered IoT end devices

Relative Q-Learning for Average-Reward Markov Decision Processes with Continuous States

Incremental Improvements of Heuristic Policies for Average-Reward Markov Decision Processes

Balanced Dynamic Planning in Green Heterogeneous Cellular Networks

Location Aware Opportunistic Bandwidth Sharing between Static and Mobile Users with Stochastic Learning in Cellular Networks

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

Mode Selection and Resource Allocation in Device-to-Device Communications With User Arrivals and Departures

Delay-Aware Two-Hop Cooperative Relay Communications via Approximate MDP and Stochastic Learning

Delay-Optimal Distributed Scheduling in Multi-User Multi-Relay Cellular Wireless Networks

Stochastic Dominance-Constrained Markov Decision Processes

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

Policy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

Fuzzy optimality relation for perceptive MDPs—the average case

Achieving Target State-Action Frequencies in Multichain Average-Reward Markov Decision Processes

The LP approach in average reward MDPs with multiple cost constraints: The countable state case

The policy iteration algorithm for average reward Markov decision processes with general state space