Abstract

The greedy-step Q-learning (GQL) can effectively accelerate the Q-value updating process. However, since it is an improvement version of Q-learning, the problem of Q-value overestimation also exists. Since there are in total two max operators used to iteratively calculate Q-value in GQL, many existing solutions to reduce the Q-value estimation bias are invalid for GQL. To address the issue, an alternated greedy-step update (AGU) framework that consists of two independent Q-value estimators is proposed in this study. In the proposed AGU framework, one estimator is to determine the time step that can maximize the estimated n-step return and the other estimator is to update the prior estimator using the target value calculated on the basis of the determined time step. The convergence of AGU framework is proved in theoretical. In addition, an alternated greedy-step deterministic policy gradient (AGDPG) that can be applied to continuous-action tasks is proposed by combining the AGU framework with deep deterministic policy gradient (DDPG). Experiments of AGDPG on continuous-action tasks of MuJoCo platform highlights its superior performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call