Abstract

This paper introduces an online self-supervised method that leverages inter- and intra-level variance for video representation learning. Most existing methods tend to focus on instance-level or inter-variance encoding but ignore the intra-variance existing in clips. The key observation to solving this problem is the underlying correlation between visual and audio, in which the distribution of flow patterns in feature space is diverse, but expresses complementary similar semantics. And in the semantic feature space, the horizontal dimension of the feature matrix could be regarded as cluster labels. These cluster labels should be consistent for different modalities of the same video clip. Based on this idea, we propose an end-to-end inter-intra cross-modality contrastive clustering scheme to simultaneously optimize the inter- and intra-level contrastive loss. Experiments show that our proposed approach is able to considerably outperform previous methods for self-supervised learning on HMDB51 and UCF101 when applied to video retrieval and action recognition tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call