Abstract

A common approach to learning from delayed rewards is to use temporal difference (TD) methods for predicting future reinforcement values. They are parameterized by a recency factor λ which determines whether and how the outcomes from several consecutive time steps contribute to a single prediction update. TD(λ > 0) has been found to usually yield noticeably faster learning than TD(0), but its standard eligibility traces implementation is associated with some well known deficiencies, in particular significantly increased computation expense. This article investigates theoretically two possible ways of implementing TD(λ) without eligibility traces, both proposed by prior work. One is the TTD procedure, which efficiently approximates the effects of eligibility traces by the use of truncated TD(λ) returns. The other is experience replay, which relies on replaying TD prediction updates backwards in time. We provide novel theoretical results related to the former and present an original analysis of the effects of two variations of the latter. The ultimate effect of these investigations is a unified view of the apparently different computational techniques. This contributes to the TD(λ) research in general, by highlighting interesting relationships between several TD-based algorithms and facilitating their further analysis.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.