Abstract

The amount of 360-degree panoramas shared online has been rapidly increasing due to the availability of affordable and compact omnidirectional cameras, which offers huge amount of new information unavailable before. In this paper, we present the first work to exploit unlabeled 360-degree data for image representation learning. We propose middle-out, a new self-supervised learning task, which leverages the spatial configuration of normal field-of-view images sampled from a 360-degree image as supervisory signal. We train a Siamese ConvNet model to identify the middle image among three shuffled images sampled from a panorama by perspective projection. Compared to previous self-supervised methods that train models using image patches or video frames with limited field-of-view, our method leverages the rich semantic information contained in 360-degree images and enforces the model to not only learn about objects, but also develop a higher-level understanding about object relationships and scene structures. We quantitatively demonstrate that the feature representation learned using the proposed task is useful for a wide range of vision tasks including object classification, object detection, scene classification, semantic segmentation, and geometry estimation. We also qualitatively show that the proposed method can enforce the ConvNet to extract high-level semantic concepts, an ability which previous self-supervised learning methods have not acquired.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call