Abstract
Instance-wise contrastive learning (Instance-CL), which learns to map similar instances closer and different instances farther apart in the embedding space, has achieved considerable progress in self-supervised video representation learning. However, canonical Instance-CL does not handle properly the temporal similarities between different videos, limiting the representation capabilities of learned models. This paper presents a novel two-stage framework that combines Instance-CL and unsupervised clustering to progressively learn desirable temporal representations with high intra-class compactness. Specifically, (a) we first introduce a new consistency-preserving sampling strategy to generate positive/negative pairs. Compared to the traditional sampling methods, our sampling strategy focuses more on motion dynamics, resulting in more temporal-related feature representations. (b) To further explore the temporal similarities between videos so as to encourage intra-class compactness, we set temporal representations extracted from Instance-CL as an initializer, and iteratively use k-means clustering to generate pseudo-labels for training the encoder. We term our method as Improved Instance-CL with Deep Clustering (ICDC) and apply it to two downstream tasks, including action recognition and video retrieval. Extensive experimental results show that ICDC gains considerable improvements compared to the existing self-supervised methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have