A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Anthony Gx-Chen,Blake A Richards,Joelle Pineau,Veronica Chelu

doi:10.1609/aaai.v36i6.20639

Abstract

Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)—a policy-dependent model—and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the ?-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge—with a parameter ? capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an ??-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Model-Free Indirect RL: Temporal Difference
Shengbo Eben Li
-
Shengbo Eben LiShengbo Eben Li
01 Jan 2023
01 Jan 2023

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning
Kristopher De Asis ... Silviu Pitis
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Kristopher De Asis, et. al.Kristopher De Asis ... Silviu Pitis
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Gradient temporal-difference learning algorithms
...
-
, et. al. ...
01 Jan 2010
01 Jan 2010

Empirical Studies in Action Selection with Reinforcement Learning
Shimon Whiteson ... Peter Stone
Adaptive Behavior | VOL. 15
Shimon Whiteson, et. al.Shimon Whiteson ... Peter Stone
01 Mar 2007
Adaptive Behavior | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence