Abstract

The estimation of value function is critical for model-free RL algorithms. Unlike Monte Carlo (MC) methods, temporal difference (TD) methods learn the value function by reusing existing value estimates. The underlying mechanism in TD is bootstrapping. The word “bootstrapping” originated in the early 19th century with the expression “pulling oneself up by one’s own bootstraps”. Initially, this expression implied an obviously impossible feat. Later, it became a metaphor for achieving success with self-assistance. In statistical learning, bootstrapping can be interpreted as a sample reuse technique that uses historical estimates in the update step for the same kind of estimated value. In temporal difference, bootstrapping is a mechanism to reuse historical value estimates to update current value function. Similar to MC, TD only uses experience to estimate the value function without knowing any prior knowledge of the environment dynamics. The advantage of TD lies in the fact that it can update the value function based on its current estimate. Therefore, TD learning algorithms can learn from incomplete episodes or continuing tasks in a step-by-step manner, while MC must be implemented in an episodeby- episode fashion.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.