Discovery of Visual Semantics by Unsupervised and Self-Supervised Representation Learning

Gustav Larsson

doi:10.6082/m1610xgg

Gustav Larsson

PDF Available

https://doi.org/10.6082/m1610xgg

Copy DOI

Export

Save

Cite

Publication Date: Aug 19, 2017

Citations: 3

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

The success of deep learning in computer vision is rooted in the ability of deep networks to scale up model complexity as demanded by challenging visual tasks. As complexity is increased, so is the demand for large amounts of labeled data to train the model. This is associated with a costly human annotation effort. Modern vision networks often rely on a two-stage training process to satisfy this thirst for training data: the first stage, pretraining, is done on a general vision task where a large collection of annotated data is available. This primes the network with semantic knowledge that is general to a wide variety of vision tasks. The second stage, fine-tuning, continues the training of the network, this time for the target task where annotations are often scarce. The reliance on supervised pretraining anchors future progress to a constant human annotation effort, especially for new or ever-changing domains. To address this concern, with the long-term goal of leveraging the abundance of cheap unlabeled data, we explore methods of unsupervised pretraining. In particular, we propose to use self-supervised automatic image colorization.,We begin by evaluating two baselines for leveraging unlabeled data for representation learning. One is based on training a mixture model for each layer in a greedy manner. We show that this method excels on relatively simple tasks in the small sample regime. It can also be used to produce a well-organized feature space that is equivariant to cyclic transformations, such as rotation. Second, we consider autoencoders, which are trained end-to-end and thus avoid the main concerns of greedy training. However, its per-pixel loss is not a good analog to perceptual similarity and the representation suffers as a consequence. Both of these methods leave a wide gap between unsupervised and supervised pretraining.,As a precursor to our improvements in unsupervised representation learning, we develop a novel method for automatic colorization of grayscale images and focus initially on its use as a graphics application. We set a new state-of-the-art that handles a wide variety of scenes and contexts. Our method makes it possible to revitalize old black-and-white photography, without requiring human effort or expertise. In order for the model to appropriately re-color a grayscale object, it must first be able to identify it. Since such high-level semantic knowledge benefits colorization, we found success employing the two-stage training process with supervised pretraining. This raises the question: If colorization and classification both benefit from the same visual semantics, can we reverse the relationship and use colorization to benefit classification?,Using colorization as a pretraining method does not require data annotations, since labeled training pairs are automatically constructed by separating intensity and color. The task is what is called self-supervised. Colorization joins a growing family of self-supervision methods as a front-runner with state-of-the-art results. We show that up to a certain sample size, labeled data can be entirely replaced by a large collection of unlabeled data. If these techniques continue to improve, they may one day supplant supervised pretraining altogether. We provide a significant step toward this goal.,As a future direction for self-supervision, we investigate if multiple proxy tasks can be combined to improve generalization in the representation. A wide range of combination methods is explored, both offline methods that fuse or distill already trained networks, and online methods that actively train multiple tasks together. On controlled experiments, we demonstrate significant gains using both offline and online methods. However, the benefits do not translate to self-supervision pretraining, leaving the question of multi-proxy self-supervision an open and interesting problem.

Full Text