Offline reinforcement learning (RL) has been widely used in practice due to its efficient data utilization, but it still faces the challenge of training vulnerability caused by policy deviation. Existing offline RL methods that add policy constraints or perform conservative Q -value estimation are pessimistic, making the learned policy suboptimal. In this article, we address the pessimism problem by focusing on accurate Q -value estimation. We propose the de-pessimism (DEP) operator to estimate Q values using the optimal Bellman operator or the compensation operator according to whether the actions are in the behavior support set. The compensation operator qualitatively determines the positive or negative nature of out-of-distribution (OOD) actions based on their performance compared with the behavior actions. It leverages differences in state values to compensate for the Q value of positive OOD actions, thereby alleviating pessimism. We theoretically demonstrate the convergence of DEP and its effectiveness in policy improvement. To further advance the practical application, we integrate DEP into the soft actor-critic (SAC) algorithm, yielding the value-compensated de-pessimism offline RL (DoRL-VC). Experimentally, DoRL-VC achieves state-of-the-art (SOTA) performance across mujoco locomotion, Maze 2-D, and challenging Adroit tasks, illustrating the efficacy of DEP in mitigating pessimism.
Read full abstract