Discounted Markov Decision Processes Research Articles

Reinforcement learning aims to find policies that maximize an expected cumulative reward in Markov decision processes with unknown transition probabilities. Policy gradient (PG)-algorithms use stochastic gradients of the value function to update the policy. A major drawback of PG-algorithms is that they are limited to episodic tasks (multiple finite-horizon trajectories) unless stringent stationarity assumptions are imposed on the trajectories. Hence, they need restarts and cannot be fully implemented online, which is critical for systems need to adapt to new tasks and/or environments in deployment. Moreover, the standard stationary formulation ignores transient behaviors. This motivates our study of discounted MDPs of infinite horizon without restarts. However, it is unknown if in this case following stochastic PG-type estimates would improve the policy. The main result of this work is to establish that when policies belong to a reproducing kernel Hilbert space (RKHS), and the kernel is selected properly, then these PG-estimates are ascent directions for the value function conditioned to any arbitrary initial point. This allows us to prove convergence of our online algorithm to the local optima. A numerical example shows that an agent running our online algorithm learns to navigate and succeeds in a surveillance task that requires looping between two goal locations. This example corroborates our theoretical findings about the ascent directions of subsequent stochastic gradients. It also shows how our online algorithm guides the agent through a continuing cyclic trajectory that does not comply with the standard stationarity assumptions in the literature for non-episodic training.

Read full abstract

In this paper, we consider a large class of constrained non-cooperative stochastic Markov games with countable state spaces and discounted cost criteria. In one-player case, i.e., constrained discounted Markov decision models, it is possible to formulate a static optimisation problem whose solution determines a stationary optimal strategy (alias control or policy) in the dynamical infinite horizon model. This solution lies in the compact convex set of all occupation measures induced by strategies, defined on the set of state–action pairs. In case of n-person discounted games the occupation measures are induced by strategies of all players. Therefore, it is difficult to generalise the approach for constrained discounted Markov decision processes directly. It is not clear how to define the domain for the best-response correspondence whose fixed point induces a stationary equilibrium in the Markov game. This domain should be the Cartesian product of compact convex sets in a locally convex topological vector space. One of our main results shows how to overcome this difficulty and define a constrained non-cooperative static game whose Nash equilibrium induces a stationary Nash equilibrium in the Markov game. This is done for games with bounded cost functions and positive initial state distribution. An extension to a class of Markov games with unbounded costs and arbitrary initial state distribution relies on an approximation of the unbounded game by bounded ones with positive initial state distributions. In the unbounded case, we assume the uniform integrability of the discounted costs with respect to all probability measures induced by strategies of the players, defined on the space of plays (histories) of the game. Our assumptions are weaker than those applied in earlier works on discounted dynamic programming or stochastic games using the so-called weighted norm approach.

Read full abstract

Discounted Markov Decision Processes Research Articles

Related Topics

Articles published on Discounted Markov Decision Processes

A new efficient parallel hierarchical value iteration algorithm using dynamic processor distribution

Managing Resources for Shared Micromobility: Approximate Optimality in Large-Scale Systems

Preferences, risk neutrality and risk-sensitive MDPs

Fast Rates for the Regret of Offline Reinforcement Learning

Risk-sensitive discounted Markov decision processes with unbounded reward functions and Borel spaces

On Supervised Online Rolling-Horizon Control for Infinite-Horizon Discounted Markov Decision Processes

Deterministic Discounted Markov Decision Processes with Fuzzy Rewards/Costs

Optimal discharge of patients from intensive care via a data-driven policy learning framework

AlphaSnake: Policy Iteration on a Nondeterministic NP-Hard Markov Decision Process (Student Abstract)

A unified algorithm framework for mean-variance optimization in discounted Markov decision processes

Geometry and convergence of natural policy gradient methods

Turnpikes in Finite Markov Decision Processes and Random Walk

Policy Gradient for Continuing Tasks in Discounted Markov Decision Processes

An optimistic value iteration for mean–variance optimization in discounted Markov decision processes

Stability-constrained Markov Decision Processes using MPC

Constrained Discounted Stochastic Games

Nonasymptotic Analysis of Monte Carlo Tree Search

A new parallelized of hierarchical value iteration algorithm for discounted Markov decision processes

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Variance reduced value iteration and faster algorithms for solving Markov decision processes

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Discounted Markov Decision Processes Research Articles

Related Topics

Articles published on Discounted Markov Decision Processes

A new efficient parallel hierarchical value iteration algorithm using dynamic processor distribution

Managing Resources for Shared Micromobility: Approximate Optimality in Large-Scale Systems

Preferences, risk neutrality and risk-sensitive MDPs

Fast Rates for the Regret of Offline Reinforcement Learning

Risk-sensitive discounted Markov decision processes with unbounded reward functions and Borel spaces

On Supervised Online Rolling-Horizon Control for Infinite-Horizon Discounted Markov Decision Processes

Deterministic Discounted Markov Decision Processes with Fuzzy Rewards/Costs

Optimal discharge of patients from intensive care via a data-driven policy learning framework

AlphaSnake: Policy Iteration on a Nondeterministic NP-Hard Markov Decision Process (Student Abstract)

A unified algorithm framework for mean-variance optimization in discounted Markov decision processes

Geometry and convergence of natural policy gradient methods

Turnpikes in Finite Markov Decision Processes and Random Walk

Policy Gradient for Continuing Tasks in Discounted Markov Decision Processes

An optimistic value iteration for mean–variance optimization in discounted Markov decision processes

Stability-constrained Markov Decision Processes using MPC

Constrained Discounted Stochastic Games

Nonasymptotic Analysis of Monte Carlo Tree Search

A new parallelized of hierarchical value iteration algorithm for discounted Markov decision processes

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Variance reduced value iteration and faster algorithms for solving Markov decision processes