Abstract

We propose a self-supervised learning method to uncover the spatial or temporal structure of visual data by identifying the position of a patch within an image or the position of a video frame over time, which is related to Jigsaw puzzle reassembly problem in previous works. A Jigsaw puzzle can be seen as a shuffled sequence, which is generated by shuffling image patches or video frames according to an unknown permutation. The task of predicting the visual permutations can be used to train a learning system to capture structural information which is important for semantic-level tasks, such as object recognition and action recognition. To this end, we propose a multi-task learning framework where a group of principal tasks aims to predict the index of each sample in the original sequence, and a group of auxiliary tasks aims to predict the spatial or temporal relation of adjacent samples in the shuffled sequence. Our scheme can handle the whole space of permutations and is fairly scalable, and it is also generic to solve many problems such as self-supervised representation learning, relative attributes, and learning to rank. Our method achieves state-of-the-art performance on the STL-10 benchmarks for unsupervised representation learning, and it is competitive with state-of-the-art performance on UCF-101 and HMDB-51 as a pretraining method for action recognition. In addition, we apply the proposed method on age comparison task to prove it is generic to solve ranking problems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call