Reinforcement learning from suboptimal demonstrations based on Reward Relabeling

Yong Peng,Junjie Zeng,Yue Hu,Qi Fang,Quanjun Yin

doi:10.1016/j.eswa.2024.124580

Abstract

Deep reinforcement learning (DRL) has achieved remarkable milestones in the field of artificial intelligence. However, the reward functions for most real-world tasks are characterized by delays and sparsity, posing significant challenges for DRL methods. To tackle the issues of delayed and sparse rewards, there have been many approaches based on the prior knowledge of expert trajectories proposed, such as GAIL and its variants. However, if only suboptimal demonstrations available, they usually struggle to overcome the performance disadvantage due to the complexity and fragility of adversarial training. To address these problems, this paper introduces a novel framework combining Self-Imitation learning with Reward Relabeling based Reinforcement learning, thus dubbed SIR3. It is capable of accelerating online learning using suboptimal demonstrations in environments even with extremely sparse rewards and meanwhile encouraging exploration of better policies. SIR3 devises a task-independent reward relabeling mechanism to generate reward signals for both the expert examples and online experience. This design provides the agent with more informative guidance, even when the number of suboptimal demonstrations is minimal. During the training process, the integration of imitation learning and RL losses enables the agent to dynamically mimic rewarding trajectories, possibly collected from experts or self-explored. Experimental findings on widely recognized MuJoCo benchmarks reveal that SIR3 can efficiently learn excellent policies surpassing suboptimal demonstrations, achieving superior training efficiency and performance relative to SOTA methods. Notably, in some environments, it secures a performance edge over an order of magnitude.

Full Text