Model-Free Indirect RL: Temporal Difference

Shengbo Eben Li

doi:10.1007/978-981-19-7784-8_4

Abstract

The estimation of value function is critical for model-free RL algorithms. Unlike Monte Carlo (MC) methods, temporal difference (TD) methods learn the value function by reusing existing value estimates. The underlying mechanism in TD is bootstrapping. The word “bootstrapping” originated in the early 19th century with the expression “pulling oneself up by one’s own bootstraps”. Initially, this expression implied an obviously impossible feat. Later, it became a metaphor for achieving success with self-assistance. In statistical learning, bootstrapping can be interpreted as a sample reuse technique that uses historical estimates in the update step for the same kind of estimated value. In temporal difference, bootstrapping is a mechanism to reuse historical value estimates to update current value function. Similar to MC, TD only uses experience to estimate the value function without knowing any prior knowledge of the environment dynamics. The advantage of TD lies in the fact that it can update the value function based on its current estimate. Therefore, TD learning algorithms can learn from incomplete episodes or continuing tasks in a step-by-step manner, while MC must be implemented in an episodeby- episode fashion.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Model-Free Indirect RL: Temporal Difference

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions
Anthony Gx-Chen ... Veronica Chelu
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 36
Anthony Gx-Chen, et. al.Anthony Gx-Chen ... Veronica Chelu
28 Jun 2022
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 36

Which Temporal Difference learning algorithm best reproduces dopamine activity in a multi-choice task?
Jean Bellot ... Mehdi Khamassi
BMC Neuroscience | VOL. 14
Jean Bellot, et. al.Jean Bellot ... Mehdi Khamassi
01 Jul 2013
BMC Neuroscience | VOL. 14

Which Temporal Difference Learning Algorithm Best Reproduces Dopamine Activity in a Multi-choice Task?
Jean Bellot ... Olivier Sigaud
-
Jean Bellot, et. al.Jean Bellot ... Olivier Sigaud
01 Jan 2012
01 Jan 2012

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning
Kristopher De Asis ... Silviu Pitis
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Kristopher De Asis, et. al.Kristopher De Asis ... Silviu Pitis
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Model-Free Indirect RL: Temporal Difference

Abstract

Talk to us

Similar Papers