Abstract

A variety of algorithms has been proposed to mitigate the overestimation bias of Q-learning. These algorithms reduce the estimation of maximum Q-value, i.e., homogeneous update. As a result, some of these algorithms such as Double Q-learning suffer from the underestimation bias. Different from these algorithms, this paper proposes a heterogeneous update idea. It aims to enlarge the normalized gap between Q-value corresponding to the optimal action and that corresponding to the other actions. Based on heterogeneous update, we design HetUp Q-learning. More specifically, HetUp Q-learning increases the normalized gap by overestimating Q-value corresponding to the optimal action and underestimating Q-value corresponding to the other actions. However, one limitation is that our HetUp Q-learning takes the optimal action as input to decide whether a state-action pair should be overestimated or underestimated. To address this challenge, we apply a softmax strategy to estimate the optimal action and obtain HetUpSoft Q-learning. We also extend HetUpSoft Q-learning to HetUpSoft DQN for high-dimensional environments. Extensive experiment results show that our proposed methods outperform SOTA baselines drastically in different settings. In particular, HetUpSoft DQN improves the average score per episode over SOTA baselines by at least 55.49% and 32.26% in the Pixelcopter and Breakout environments, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.