Attentive spatial-temporal contrastive learning for self-supervised video representation

Xingming Yang,Sixuan Xiong,Kewei Wu,Dongfeng Shan,Zhao Xie

doi:10.1016/j.imavis.2023.104765

Xingming Yang, Sixuan Xiong + Show 3 more

Open Access

https://doi.org/10.1016/j.imavis.2023.104765

Copy DOI

Abstract

Most existing self-supervised works learn video representation by using a single pretext task. A single pretext task, providing single supervision from unlabeled data, may neglect to describe the difference between spatial features and temporal features. The similar spatial features and temporal features may hinder distinguishing between two similar videos with different class labels. In this paper, we propose an attentive spatial–temporal contrastive learning network (ASTCNet), which learns self-attention spatial–temporal features by contrastive learning between multiple spatial and temporal pretext tasks. The spatial features are learned by multiple spatial pretext tasks, including spatial rotation, and spatial jigsaw. Each spatial feature is enhanced with spatial self-attention by learning the relation between patches. The temporal features are learned by multiple temporal pretext tasks, including temporal order, and temporal pace. Each temporal feature is enhanced with temporal self-attention by learning the relation between frames, and is enhanced by feeding the optical flow features into a motion module. To separate the spatial feature and the temporal feature learned in one video, we represent the video as different features for each pretext task, and design pretext task-based contrastive loss. The pretext task-based contrastive loss encourages the different pretext tasks to learn dissimilar features, and encourages the same pretext task to learn similar features. The pretext task-based contrastive loss can learn the discriminative features for each pretext task in one video. The experiments show that our method achieves state-of-the-art performance for self-supervised action recognition on the UCF101 dataset and the HMDB51 dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Image and Vision Computing	Publication Date: Jul 11, 2023
Citations: 3	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Attentive spatial-temporal contrastive learning for self-supervised video representation

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Similar Papers

Volumetric white matter tract segmentation with nested self-supervised learning using sequential pretext tasks.
Qi Lu ... Yuxing Li
Medical Image Analysis | VOL. 72
Qi Lu, et. al.Qi Lu ... Yuxing Li
30 Apr 2021
Medical Image Analysis | VOL. 72

Pretext Tasks Selection for Multitask Self-Supervised Audio Representation Learning
Salah Zaiem ... Titouan Parcollet
IEEE Journal of Selected Topics in Signal Processing | VOL. 16
Salah Zaiem, et. al.Salah Zaiem ... Titouan Parcollet
01 Oct 2022
IEEE Journal of Selected Topics in Signal Processing | VOL. 16

A Generic Self-Supervised Learning (SSL) Framework for Representation Learning from Spectral–Spatial Features of Unlabeled Remote Sensing Imagery
Xin Zhang ... Liangxiu Han
Remote Sensing | VOL. 15
Xin Zhang, et. al.Xin Zhang ... Liangxiu Han
03 Nov 2023
Remote Sensing | VOL. 15

Self-supervised learning for heterogeneous graph via structure information based on metapath
Shuai Ma ... Xin Zuo
Applied Soft Computing | VOL. 143
Shuai Ma, et. al.Shuai Ma ... Xin Zuo
12 May 2023
Applied Soft Computing | VOL. 143

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Attentive spatial-temporal contrastive learning for self-supervised video representation

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing