CST-RL: Contrastive Spatio-Temporal Representations for Reinforcement Learning

Chi-Kai Ho,Chung-Ta King

doi:10.1109/access.2023.3258540

Abstract

Learning representations from high-dimensional observations is critical for training of pixel-based continuous control tasks with reinforcement learning (RL). Without proper representations, the training will be very inefficient, requiring long training time and huge training data to learn directly from low-level pixel observations. Yet, a lot of information in such observations may be redundant or irrelevant. A common approach to solving this problem is to train auxiliary objectives alongside the main RL objective. The additional objectives provide more signals to the model and reduce the training time, resulting in better sample efficiency. A representative work is Contrastive Unsupervised Representations for Reinforcement Learning (CURL), which leverages contrastive learning to assist RL to learn useful representations. Although CURL performs very well in extracting spatial information from pixel inputs, it is found to overlook potential temporal signals. In this paper, a contrastive spatio-temporal representation learning framework for RL, called CST-RL, is introduced, which leverages 3D Convolutional Neural Network (3D CNN) alongside contrastive learning for sample-efficient RL. It pays attention to both spatial and temporal signals in pixel observations. Experiments based on DMControl show that CST-RL outperforms CURL in all six environments after 500K environment steps and only needs half of the steps to achieve the standard score in the majority of cases.

Full Text