Abstract

We address the problem of learning representations from the videos without manual annotation. Different video clips sampled from the same video usually have a similar background and consistent motion. A novel self-supervised task is designed to learn such temporal coherence, which is measured by the mutual information in our work. First, we maximize the mutual information between features extracted from the clips which are sampled from the same video. This encourages the network to learn the shared content by these clips. As a result, the network may focus on the background and ignore the motion in videos due to that different clips from the same video normally have the same background. Second, to address this issue, we simultaneously maximize the mutual information between the feature of the video clip and the local regions where salient motion exists. Our approach, which is referred to as Deep Video Infomax (DVIM), strikes a balance between the background and the motion when learning the temporal coherence. We conduct extensive experiments to test the performance of the proposed DVIM on various tasks. Experimental results of fine-tuning for the high-level action recognition problems validate the effectiveness of the learned representations. Additional experiments for the task of action similarity labeling also demonstrate the generalization of the learned representations of the DVIM.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.