Abstract
This paper introduces an online self-supervised method that leverages inter- and intra-level variance for video representation learning. Most existing methods tend to focus on instance-level or inter-variance encoding but ignore the intra-variance existing in clips. The key observation to solving this problem is the underlying correlation between visual and audio, in which the distribution of flow patterns in feature space is diverse, but expresses complementary similar semantics. And in the semantic feature space, the horizontal dimension of the feature matrix could be regarded as cluster labels. These cluster labels should be consistent for different modalities of the same video clip. Based on this idea, we propose an end-to-end inter-intra cross-modality contrastive clustering scheme to simultaneously optimize the inter- and intra-level contrastive loss. Experiments show that our proposed approach is able to considerably outperform previous methods for self-supervised learning on HMDB51 and UCF101 when applied to video retrieval and action recognition tasks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.