Accelerating Lifelong Reinforcement Learning via Reshaping Rewards

Kun Chu,William Zhu,Xianchao Zhu

doi:10.1109/smc52423.2021.9659064

Abstract

The reinforcement learning (RL) problem is typically formalized as the Markov Decision Process (MDP), where an agent interacts with the environment to maximize the long-term expected reward. As an important branch of RL, Lifelong RL requires the agent to consecutively solve a series of tasks modeled as MDPs, each of which is drawn from some distribution. A crucial issue in Lifelong RL is how best to utilize the knowledge from the previous tasks for improving the performance in the current task. As a pioneering work in this field, MaxQInit takes the maximum over action-values learned from previous tasks’ environmental rewards as the initial action-value of the current task. In this way, MaxQInit improves the initial performance in the current task and reduces the sample complexity of learning. However, the rewards obtained in the learning process are usually delayed and sparse, dramatically decreasing the learning efficiency. In this paper, we propose a new method, Shaping Rewards for Lifelong RL (SR-LLRL), to speed up the lifelong learning process by shaping timely and informative rewards for each task. Critically, we construct the Lifetime Reward Shaping (LRS) function based on the knowledge of optimal trajectories collected in previous tasks, providing additional reward information for the current task. Compared with MaxQInit, our method exhibits higher learning efficiency and superior performance in Lifelong RL experiments.

Full Text