Reinforcement Learning (RL) has found widespread applications in various decision making tasks. However, it still faces challenges such as the deadly triad, slow convergence, and rewards drop, which limit its practical scope. This paper primarily addresses the issues of slow convergence and rewards drop observed during RL training. Our proposed solution involves a reward shaping method composed of two distinct components, each serving a specific purpose: speeding up training and enhancing stability. These two components are interconnected through hyper-parameters, and it has been observed that the choice of hyper-parameters plays a critical role in determining the final performance of the RL algorithm. To optimize these hyper-parameters effectively, we employ a discrete sampling technique to cover the value ranges of these parameters. This discrete sampling creates a sparse set of data points within the reward matrix. Subsequently, we introduce a fitting approach based on the Expectation Maximization (EM) algorithm to estimate the global maximum of the reward matrix along with the corresponding hyper-parameters combination. This EM method significantly reduces computational complexity. Our extensive experimental results across various RL environments have demonstrated the effectiveness of our proposed method. It successfully mitigates the issue of rewards drop while simultaneously accelerating the convergence speed of the RL algorithm.
Read full abstract