Abstract

Most existing self-supervised works learn video representation by using a single pretext task. A single pretext task, providing single supervision from unlabeled data, may neglect to describe the difference between spatial features and temporal features. The similar spatial features and temporal features may hinder distinguishing between two similar videos with different class labels. In this paper, we propose an attentive spatial–temporal contrastive learning network (ASTCNet), which learns self-attention spatial–temporal features by contrastive learning between multiple spatial and temporal pretext tasks. The spatial features are learned by multiple spatial pretext tasks, including spatial rotation, and spatial jigsaw. Each spatial feature is enhanced with spatial self-attention by learning the relation between patches. The temporal features are learned by multiple temporal pretext tasks, including temporal order, and temporal pace. Each temporal feature is enhanced with temporal self-attention by learning the relation between frames, and is enhanced by feeding the optical flow features into a motion module. To separate the spatial feature and the temporal feature learned in one video, we represent the video as different features for each pretext task, and design pretext task-based contrastive loss. The pretext task-based contrastive loss encourages the different pretext tasks to learn dissimilar features, and encourages the same pretext task to learn similar features. The pretext task-based contrastive loss can learn the discriminative features for each pretext task in one video. The experiments show that our method achieves state-of-the-art performance for self-supervised action recognition on the UCF101 dataset and the HMDB51 dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call