Abstract

In this study, we propose a novel pretext task and a self-supervised motion perception (SMP) method for spatiotemporal representation learning. The pretext task is defined as video playback rate perception, which utilizes temporal dilated sampling to augment video clips to multiple duplicates of different temporal resolutions. The SMP method is built upon discriminative and generative motion perception models, which capture representations related to motion dynamics and appearance from video clips of multiple temporal resolutions in a collaborative fashion. To enhance the collaboration, we further propose difference and convolution motion attention (MA), which drives the generative model focusing on motion-related appearance, and leverage multiple granularity perception (MG) to extract accurate motion dynamics. Extensive experiments demonstrate SMP's effectiveness for video motion perception and state-of-the-art performance of self-supervised representation models upon target tasks, including action recognition and video retrieval. Code for SMP is available at github.com/yuanyao366/SMP.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.