Policy Evaluation with Temporal Differences: A Survey and Comparison (Extended Abstract)

Christoph Dann,Jan Peters,Gerhard Neumann

doi:10.1609/icaps.v25i1.13686

Abstract

Value functions are an essential tool for solving sequential decision making problems such as Markov decision processes (MDPs). Computing the value function for a given policy (policy evaluation) is not only important for determining the quality of the policy but also a key step in prominent policy-iteration-type algorithms. In common settings where a model of the Markov decision process is not available or too complex to handle directly, an approximation of the value function is usually estimated from samples of the process. Linearly parameterized estimates are often preferred due to their simplicity and strong stability guarantees. Since the late 1980s, research on policy evaluation in these scenarios has been dominated by temporal-difference (TD) methods because of their data-efficiency. However, several core issues have only been tackled recently, including stability guarantees for off-policy estimation where the samples are not generated by the policy to evaluate. Together with improving sample efficiency and probabilistic treatment of uncertainty in the value estimates, these efforts have lead to numerous new temporal-difference algorithms. These methods are scattered over the literature and usually only compared to most similar approaches. The article therefore aims at presenting the state of the art of policy evaluation with temporal differences and linearly parameterized value functions in discounted MDPs as well as a more comprehensive comparison of these approaches.We put the algorithms in a unified framework of function optimization, with focus on surrogate cost functions and optimization strategies, to identify similarities and differences between the methods. In addition, important extensions of the base methods such as off-policy estimation and eligibility traces for better bias-variance trade-off, as well as regularization in high dimensional feature spaces, are discussed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Policy Evaluation with Temporal Differences: A Survey and Comparison (Extended Abstract)

Abstract

Talk to us

Similar Papers

More From: Proceedings of the International Conference on Automated Planning and Scheduling

Lead the way for us

Similar Papers

Simple and Optimal Methods for Stochastic Variational Inequalities, II: Markovian Noise and Policy Evaluation in Reinforcement Learning
Georgios Kotsalis ... Tianjiao Li
SIAM Journal on Optimization | VOL. 32
Georgios Kotsalis, et. al.Georgios Kotsalis ... Tianjiao Li
26 May 2022
SIAM Journal on Optimization | VOL. 32

Robust control for Markov jump linear systems with unknown transition probabilities – an online temporal differences approach
Yaogang Chen ... Fei Liu
Transactions of the Institute of Measurement and Control | VOL. 42
Yaogang Chen, et. al.Yaogang Chen ... Fei Liu
25 Aug 2020
Transactions of the Institute of Measurement and Control | VOL. 42

Least Squares Temporal Difference Methods: An Analysis under General Conditions
Huizhen Yu
SIAM Journal on Control and Optimization | VOL. 50
Huizhen YuHuizhen Yu
01 Jan 2012
SIAM Journal on Control and Optimization | VOL. 50

Linear Least-Squares Algorithms for Temporal Difference Learning
Steven J Bradtke ... Andrew G Barto
-
Steven J Bradtke, et. al.Steven J Bradtke ... Andrew G Barto
21 Aug 2007
21 Aug 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Policy Evaluation with Temporal Differences: A Survey and Comparison (Extended Abstract)

Abstract

Talk to us

Similar Papers

More From: Proceedings of the International Conference on Automated Planning and Scheduling