Abstract

This work tackles the problem of self-supervised learning of video representation tasks. The related works construct different surrogate supervision signals from data itself. Instead of proposing novel signal, our main insight is that the field of self-supervised learning can be benefited from mutual learning, that is, these supervision signals can learn from others and the combination between them leads to better representation. Unifying these two approaches, we present a frame-work called Self-supervised Mutual (SSM) Learning: a simple framework for mutual learning of video representation under the content of self-supervised. In order to understand what enables the task to learn useful representation, we systematically study the major components of our framework. We show that (1) surrogate supervision signal can learn from others effectively under the framework of mutual learning; (2) introducing a learnable align unit between the deep features supervised by multiple supervision signals in hidden space improves the quality of the learned representation. By combining these findings, we are able to considerably outperform previous methods for self-supervised learning on HMDB51 and UCF101 when applied to action recognition tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call