Abstract
A common approach to learning from delayed rewards is to use temporal difference (TD) methods for predicting future reinforcement values. They are parameterized by a recency factor λ which determines whether and how the outcomes from several consecutive time steps contribute to a single prediction update. TD(λ > 0) has been found to usually yield noticeably faster learning than TD(0), but its standard eligibility traces implementation is associated with some well known deficiencies, in particular significantly increased computation expense. This article investigates theoretically two possible ways of implementing TD(λ) without eligibility traces, both proposed by prior work. One is the TTD procedure, which efficiently approximates the effects of eligibility traces by the use of truncated TD(λ) returns. The other is experience replay, which relies on replaying TD prediction updates backwards in time. We provide novel theoretical results related to the former and present an original analysis of the effects of two variations of the latter. The ultimate effect of these investigations is a unified view of the apparently different computational techniques. This contributes to the TD(λ) research in general, by highlighting interesting relationships between several TD-based algorithms and facilitating their further analysis.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have