Abstract

In this paper, we propose a self-supervised contrastive learning method to learn video feature representations. In traditional self-supervised contrastive learning methods, constraints from anchor, positive, and negative data pairs are used to train the model. In such a case, different samplings of the same video are treated as positives, and video clips from different videos are treated as negatives. Because the spatio-temporal information is important for video representation, we set the temporal constraints more strictly by introducing intra-negative samples. In addition to samples from different videos, negative samples are extended by breaking temporal relations in video clips from the same anchor video. With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn feature representations from videos. Strong data augmentations, residual clips, as well as head projector are utilized to construct an improved version. Three kinds of intra-negative generation functions are proposed and extensive experiments using different network backbones are conducted on benchmark datasets. Without using pre-computed optical flow data, our improved version can outperform previous IIC by a large margin, such as 19.4% (from 36.8% to 56.2%) and 5.2% (from 15.5% to 20.7%) points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, over 3% points improvements can also be obtained on these two benchmark datasets. Discussions and visualizations validate that our IICv2 can capture better temporal clues and indicate the potential mechanism.

Highlights

  • V Ideo understanding tasks require good feature representations from videos

  • We prove that many techniques such as data modality, data transformations, and head projector are generally effective in video self-supervised learning, which can be applied to other methods in this area

  • On the basis of Inter-Intra Contrastive (IIC), we introduce many effective techniques and propose IICv2, an improved inter-intra self-supervised framework for video representation learning

Read more

Summary

Introduction

V Ideo understanding tasks require good feature representations from videos. Tasks such as video segmentation, video summarization, and video retrieval rely on effective motion representation extractors, which are usually trained on the basis of video recognition. Many works explore different network architectures [1]–[6]. In addition to using RGB frames as input data, some other works try to utilize optical flow as an additional data modality to form a two-stream model for better motion feature extraction [7]– [9]. Better results can be achieved [4]–[6]. Hara et al [3] argued that video recognition can imitate image recognition procedures, which means that the performance can be significantly improved with larger datasets

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.