Abstract

Q-learning (QL) is a popular method for control problems, which approximates the maximum expected action value using the maximum estimated action value, thus it suffers from positive overestimation bias. Various algorithms have been proposed to reduce overestimation bias. But some of these methods cause underestimation bias. Furtherly, about underestimation, which kind of estimators causes underestimation is less understood. In this paper, instead of a certain specific method, we focus on underestimation estimators, especially on K estimates of the action values. We generalize these estimators to propose an Underestimation Estimator Set (UES) and theoretically prove all of the estimators in this set suffer from underestimation bias. We further study the bias properties of these estimators and conclude that their biases are different from each other’s and depend on the specific conditions they meet. Thus, our set provides various estimators for QL in different settings. Finally, to better illustrate the properties of estimators, we test the performance of several estimators in our set. Empirical results show that Median estimator (Me) underestimates less than double Q-learning (DQL) and doesn’t suffer overestimation as QL, and Min estimator (M1E) underestimates more than DQL. Besides, Me and M1E perform as well as or better than other estimators on some benchmark environments.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.