Q-learning with heterogeneous update strategy

Tao Tan,Hong Xie,Liang Feng

doi:10.1016/j.ins.2023.119902

Abstract

A variety of algorithms has been proposed to mitigate the overestimation bias of Q-learning. These algorithms reduce the estimation of maximum Q-value, i.e., homogeneous update. As a result, some of these algorithms such as Double Q-learning suffer from the underestimation bias. Different from these algorithms, this paper proposes a heterogeneous update idea. It aims to enlarge the normalized gap between Q-value corresponding to the optimal action and that corresponding to the other actions. Based on heterogeneous update, we design HetUp Q-learning. More specifically, HetUp Q-learning increases the normalized gap by overestimating Q-value corresponding to the optimal action and underestimating Q-value corresponding to the other actions. However, one limitation is that our HetUp Q-learning takes the optimal action as input to decide whether a state-action pair should be overestimated or underestimated. To address this challenge, we apply a softmax strategy to estimate the optimal action and obtain HetUpSoft Q-learning. We also extend HetUpSoft Q-learning to HetUpSoft DQN for high-dimensional environments. Extensive experiment results show that our proposed methods outperform SOTA baselines drastically in different settings. In particular, HetUpSoft DQN improves the average score per episode over SOTA baselines by at least 55.49% and 32.26% in the Pixelcopter and Breakout environments, respectively.

Full Text