Abstract

Offline reinforcement learning (RL) enables learning policies from pre-collected datasets without online data collection. Although it offers the possibility to surpass the performance of the datasets, most existing offline RL algorithms struggle to compete with behavior cloning policies in many dataset settings due to trading off policy improvement and additional regularization to address the distributional shift issue. In many cases, if one can imitate a sequence of sub-optimal sub-trajectories in data and properly <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">“stitch”</i> them toward reaching an ideal future state, it may potentially result in a more reliable policy while avoiding the difficulties that present in typical value-based offline RL algorithms. We borrow the idea of curriculum learning to embody the above intuition. We construct a curriculum that progressively imitates a sequence of sub-optimal trajectories conditioned on a series of carefully constructed future states and cumulative rewards as goals. The sub-optimal trajectories gradually guide policy learning toward reaching the ideal goal states. We name our algorithm Curriculum Goal-conditioned Imitation (CGI). Experimental results show that CGI achieves competitive performance against state-of-the-art offline RL algorithms, especially for challenging tasks with long horizons and sparse rewards.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call