Abstract

We propose a dual system for unsupervised object segmentation in video, which brings together two modules with complementary properties: a space-time graph that discovers objects in videos and a deep network that learns powerful object features. The system uses an iterative knowledge exchange policy. A novel spectral space-time clustering process on the graph produces unsupervised segmentation masks passed to the network as pseudo-labels. The net learns to segment in single frames what the graph discovers in video and passes back to the graph strong image-level features that improve its node-level features in the next iteration. Knowledge is exchanged for several cycles until convergence. The graph has one node per each video pixel, but the object discovery is fast. It uses a novel power iteration algorithm computing the main space-time cluster as the principal eigenvector of a special Feature-Motion matrix without actually computing the matrix. The thorough experimental analysis validates our theoretical claims and proves the effectiveness of the cyclical knowledge exchange. We also perform experiments on the supervised scenario, incorporating features pretrained with human supervision. We achieve state-of-the-art level on unsupervised and supervised scenarios on four challenging datasets: DAVIS, SegTrack, YouTube-Objects, and DAVSOD. We will make our code publicly available.

Highlights

  • D ISCOVERING objects, without human supervision, as they move and change appearance over space and time is one of the most challenging and still unsolved problems in computer vision

  • In order to compute this primary cluster fast, we introduce a novel algorithm for spectral clustering, which finds the cluster on the principal eigenvector of the Feature-Motion matrix, without explicitly going through the expensive process of building the matrix (Sec. 4.3)

  • The main contributions of our work are: (1) We introduce a dual Iterative Knowledge Exchange (IKE) system (Sec. 3, Fig. 1), which puts together a space-time graph and a deep neural network, to learn, as a single self-supervised entity

Read more

Summary

Introduction

D ISCOVERING objects, without human supervision, as they move and change appearance over space and time is one of the most challenging and still unsolved problems in computer vision. One of the main questions we ask is how we could best exploit the correlation between objects’ motion and appearance to model mathematically the process of object discovery in the absence of human supervision. By solving this task, we could learn more efficiently from large quantities of data available in the space-time domain, with minimal human intervention. While visual grouping has been studied since the early Gestalt School of psychology [1], we have not yet found the underlying computational model that could best describe such unsupervised discovery. In the context of unsupervised segmentation in video, we bring face-to-face two important computational learning worlds: that of deep learning, with strong supervised learning abilities, and that of iterative graph algorithms, with proved unsupervised clustering strengths

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call