Abstract

The greedy-step Q-learning (GQL) can effectively accelerate the Q-value updating process. However, since it is an improvement version of Q-learning, the problem of Q-value overestimation also exists. Since there are in total two max operators used to iteratively calculate Q-value in GQL, many existing solutions to reduce the Q-value estimation bias are invalid for GQL. To address the issue, an alternated greedy-step update (AGU) framework that consists of two independent Q-value estimators is proposed in this study. In the proposed AGU framework, one estimator is to determine the time step that can maximize the estimated n-step return and the other estimator is to update the prior estimator using the target value calculated on the basis of the determined time step. The convergence of AGU framework is proved in theoretical. In addition, an alternated greedy-step deterministic policy gradient (AGDPG) that can be applied to continuous-action tasks is proposed by combining the AGU framework with deep deterministic policy gradient (DDPG). Experiments of AGDPG on continuous-action tasks of MuJoCo platform highlights its superior performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.