Self-Supervised Video Representation Learning Using Improved Instance-Wise Contrastive Learning and Deep Clustering

Yisheng Zhu,Qingshan Liu,Guangcan Liu,Hui Shuai

doi:10.1109/tcsvt.2022.3169469

Yisheng Zhu, Qingshan Liu + Show 2 more

https://doi.org/10.1109/tcsvt.2022.3169469

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Instance-wise contrastive learning (Instance-CL), which learns to map similar instances closer and different instances farther apart in the embedding space, has achieved considerable progress in self-supervised video representation learning. However, canonical Instance-CL does not handle properly the temporal similarities between different videos, limiting the representation capabilities of learned models. This paper presents a novel two-stage framework that combines Instance-CL and unsupervised clustering to progressively learn desirable temporal representations with high intra-class compactness. Specifically, (a) we first introduce a new consistency-preserving sampling strategy to generate positive/negative pairs. Compared to the traditional sampling methods, our sampling strategy focuses more on motion dynamics, resulting in more temporal-related feature representations. (b) To further explore the temporal similarities between videos so as to encourage intra-class compactness, we set temporal representations extracted from Instance-CL as an initializer, and iteratively use k-means clustering to generate pseudo-labels for training the encoder. We term our method as Improved Instance-CL with Deep Clustering (ICDC) and apply it to two downstream tasks, including action recognition and video retrieval. Extensive experimental results show that ICDC gains considerable improvements compared to the existing self-supervised methods.

Full Text