When applying Reinforcement Learning (RL) to the real-world visual tasks, two major challenges necessitate consideration: sample inefficiency and limited generalization. To address the above two challenges, previous works focus primarily on learning semantic information from the visual state for improving sample efficiency, but they do not explicitly learn other valuable aspects, such as spatial information. Moreover, they improve generalization by learning representations that are invariant to alterations of task-irrelevant variables, without considering task-relevant variables. To enhance sample efficiency and generalization of the base RL algorithm in visual tasks, we propose an auxiliary task called Recovering Permuted Sequential Features (RPSF). Our method enhances generalization by learning the spatial structure information of the agent, which can mitigate the effects of changes in both task-relevant and task-irrelevant variables. Moreover, it explicitly learns both semantic and spatial information from the visual state by disordering and subsequently recovering a sequence of features to generate more holistic representations, thereby improving sample efficiency. Extensive experiments demonstrate that our method significantly improves the sample efficiency and generalization of the base RL algorithm and outperforms various state-of-the-art baselines across diverse tasks in unseen environments. Furthermore, our method exhibits compatibility with both CNN and Transformer architectures.
Read full abstract