Abstract

This work studies a new reinforcement learning method in the framework of Recursive Least-Squares Temporal Difference (RLS-TD). Differently from the standard mechanism of eligibility traces, leading to RLS-TD(λ), in this work we show that the forgetting factor commonly used in gradient-based estimation has a similar role to the mechanism of eligibility traces. We adopt an instrumental variable perspective to illustrate this point and we propose a new algorithm, namely — RLS-TD with forgetting factor (RLS-TD-f). We test the proposed algorithm in a Policy Iteration setting, i.e. when the performance of an initially stabilizing controller must be improved. We take the cart-pole benchmark as experimental platform: extensive experiments show that the proposed RLS-TD algorithm exhibits larger performance improvements in the largest portion of the state space.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.