Abstract
Learning representations from high-dimensional observations is critical for training of pixel-based continuous control tasks with reinforcement learning (RL). Without proper representations, the training will be very inefficient, requiring long training time and huge training data to learn directly from low-level pixel observations. Yet, a lot of information in such observations may be redundant or irrelevant. A common approach to solving this problem is to train auxiliary objectives alongside the main RL objective. The additional objectives provide more signals to the model and reduce the training time, resulting in better sample efficiency. A representative work is Contrastive Unsupervised Representations for Reinforcement Learning (CURL), which leverages contrastive learning to assist RL to learn useful representations. Although CURL performs very well in extracting spatial information from pixel inputs, it is found to overlook potential temporal signals. In this paper, a contrastive spatio-temporal representation learning framework for RL, called CST-RL, is introduced, which leverages 3D Convolutional Neural Network (3D CNN) alongside contrastive learning for sample-efficient RL. It pays attention to both spatial and temporal signals in pixel observations. Experiments based on DMControl show that CST-RL outperforms CURL in all six environments after 500K environment steps and only needs half of the steps to achieve the standard score in the majority of cases.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.