We present a principled approach to uncover the structure of visual data by solving a deep learning task coined visual permutation learning. The goal of this task is to find the permutation that recovers the structure of data from shuffled versions of it. In the case of natural images, this task boils down to recovering the original image from patches shuffled by an unknown permutation matrix. Permutation matrices are discrete, thereby posing difficulties for gradient-based optimization methods. To this end, we resort to a continuous approximation using doubly-stochastic matrices and formulate a novel bi-level optimization problem on such matrices that learns to recover the permutation. Unfortunately, such a scheme leads to expensive gradient computations. We circumvent this issue by further proposing a computationally cheap scheme for generating doubly stochastic matrices based on Sinkhorn iterations. To implement our approach we propose DeepPermNet, an end-to-end CNN model for this task. The utility of DeepPermNet is demonstrated on three challenging computer vision problems, namely, relative attributes learning, supervised learning-to-rank, and self-supervised representation learning. Our results show state-of-the-art performance on the Public Figures and OSR benchmarks for relative attributes learning, chronological and interestingness image ranking for supervised learning-to-rank, and competitive results in the classification and segmentation tasks of the PASCAL VOC dataset for self-supervised representation learning.
Read full abstract