Policy Gradient for Continuing Tasks in Discounted Markov Decision Processes

Santiago Paternain,Alejandro Ribeiro,Juan Andres Bazerque

doi:10.1109/tac.2022.3163085

Abstract

Reinforcement learning aims to find policies that maximize an expected cumulative reward in Markov decision processes with unknown transition probabilities. Policy gradient (PG)-algorithms use stochastic gradients of the value function to update the policy. A major drawback of PG-algorithms is that they are limited to episodic tasks (multiple finite-horizon trajectories) unless stringent stationarity assumptions are imposed on the trajectories. Hence, they need restarts and cannot be fully implemented online, which is critical for systems need to adapt to new tasks and/or environments in deployment. Moreover, the standard stationary formulation ignores transient behaviors. This motivates our study of discounted MDPs of infinite horizon without restarts. However, it is unknown if in this case following stochastic PG-type estimates would improve the policy. The main result of this work is to establish that when policies belong to a reproducing kernel Hilbert space (RKHS), and the kernel is selected properly, then these PG-estimates are ascent directions for the value function conditioned to any arbitrary initial point. This allows us to prove convergence of our online algorithm to the local optima. A numerical example shows that an agent running our online algorithm learns to navigate and succeeds in a surveillance task that requires looping between two goal locations. This example corroborates our theoretical findings about the ascent directions of subsequent stochastic gradients. It also shows how our online algorithm guides the agent through a continuing cyclic trajectory that does not comply with the standard stationarity assumptions in the literature for non-episodic training.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Policy Gradient for Continuing Tasks in Discounted Markov Decision Processes

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Automatic Control

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Policy Gradient for Continuing Tasks in Discounted Markov Decision Processes

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Automatic Control