Abstract

The problem of off-policy evaluation (OPE) has long been advocated as one of the foremost challenges in reinforcement learning. Gradient-based and emphasis-based temporal-difference (TD) learning algorithms comprise the major part of off-policy TD learning methods. In this work, we investigate the derivation of efficient OPE algorithms from a novel perspective based on the advantages of these two categories. The gradient-based framework is adopted, and the emphatic approach is used to improve convergence performance. We begin by proposing a new analogue of the on-policy objective, called the distribution-correction-based mean square projected Bellman error (DC-MSPBE). The key to the construction of DC-MSPBE is the use of emphatic weightings on the representable subspace of the original MSPBE. Based on this objective function, the emphatic TD with lower-variance gradient correction (ETD-LVC) algorithm is proposed. Under standard off-policy and stochastic approximation conditions, we provide the convergence analysis of the ETD-LVC algorithm in the case of linear function approximation. Further, we generalize the algorithm to nonlinear smooth function approximation. Finally, we empirically demonstrate the improved performance of our ETD-LVC algorithm on off-policy benchmarks. Taken together, we hope that our work can guide the future discovery of a better alternative in the off-policy TD learning algorithm family.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call