Abstract

Since the late 1980s, temporal difference (TD) learning has dominated the research area of policy evaluation algorithms. However, the demand for the avoidance of TD defects, such as low data-efficiency and divergence in off-policy learning, has inspired the studies of a large number of novel TD-based approaches. Gradient-based and least-squares-based algorithms comprise the major part of these new approaches. This paper aims to combine advantages of these two categories to derive an efficient policy evaluation algorithm with O ( n 2 ) per-time-step runtime complexity. The least-squares-based framework is adopted, and the gradient correction is used to improve convergence performance. This paper begins with the revision of a previous O ( n 3 ) batch algorithm, least-squares TD with a gradient correction (LS-TDC) to regularize the parameter vector. Based on the recursive least-squares technique, an O ( n 2 ) counterpart of LS-TDC called RC is proposed. To increase data efficiency, we generalize RC with eligibility traces. An off-policy extension is also proposed based on importance sampling. In addition, the convergence analysis for RC as well as LS-TDC is given. The empirical results in both on-policy and off-policy benchmarks show that RC has a higher estimation accuracy than that of RLSTD and a significantly lower runtime complexity than that of LSTDC.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call