Abstract

In Value-based and Actor-Critic reinforcement learning (RL) methods, both inaccuracy and instability of value estimation will detrimentally affect their performance. Some typical RL methods, such as Maxmin Q-learning and QMD3, are plagued by the underestimation problem while failing to trade off the estimation bias and variance jointly. We propose the Reinforced Operation (RO) to address these shortcomings, which selects the closest median among multiple Q-function. RO is applicable to any model-free RL method. Theoretically, we introduce the Mean Square Error (MSE) to jointly analyze the estimation bias and variance of value estimation methods. We also demonstrate the superiority of RO in MSE reduction and give an upper bound for the estimation bias of value estimation methods with arbitrary distribution to guide the calculation of the estimation bias of value estimation methods. Based on RO, we propose the variants of Q-learning and TD3, Reinforced Q-learning (RQ) and Reinforced Delayed Deep Deterministic policy gradient (RD3), respectively, to tackle different tasks. We empirically demonstrate that our method can reduce estimation error and achieve superior performance on discrete and continuous benchmark tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call