Abstract

Reinforcement Learning (RL) Markov Decision Processes is studied with an emphasis on the well-studied exploration problem. We first formulate and discuss a definition of “efficient” algorithms that is termed Probably Approximately Correct (PAC) in RL. Next we provide general sufficient conditions for such an algorithm that applies to several different modeling assumptions. The conditions can be used to demonstrate that efficient learning is possible in finite MDPs, with either a model-based or model-free approach, in factored MDPs, and in continuous MDPs with linear dynamics. In the reinforcement-learning (RL) problem (Sutton & Barto 1998), an agent acts in an unknown or incompletely known environment with the goal of maximizing an external reward signal. In the most standard mathematical formulation of the problem, the environment is modeled as a Markov Decision Process (MDP) and the goal of the agent is to obtain near-optimal discounted return. Over the years, many algorithms have been proposed for this problem, but analyses of their performances have been relatively scarce. In fact, until recently, most theoretical guarantees have been that certain algorithms will discover an optimal policy in the limit, after an infinite amount of experience. In contrast, several attempts have been made to study “Probably Approximately Correct” or PAC-MDP algorithms, which exhibit near-optimal behavior in polynomial time and experience. This paper discusses several extensions of those results. We present a theorem that provides sufficient conditions for an algorithm to be PAC-MDP. We examine these conditions and show how they can be applied to prove that efficient learning is possible in three interesting scenerios: finite MDPs (i.e. the “Tabular case”),

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call