Abstract

In this chapter, we approach the unsupervised learning problem in both space and time head-on, by considering both spatial and temporal dimensions from the very beginning. We couple, from the start, the appearance of objects. It also defines their spatial properties with their motion, which defines their existence in time and provide a unique graph clustering formulation in both space and time for the problem of unsupervised object discovery in video. Again we resort to a graph formulation: we have one-to-one correspondences between graph nodes and video pixels. Graph nodes that belong to the main object of interest should form a strong cluster: they are linked through trajectories defined by pixels that belong to the same physical object point and they should also have similar appearance features along those trajectories. The clustering problem aims to maximize both their agreements in motion and their similarities in appearance. Therefore, we regard this object discovery problem as a graph clustering problem for which we provide an efficient spectral approach: the object segmentation in the space-time volume is captured by the principal eigenvector of a Feature-Motion matrix, which we introduce for the first time. Even though the matrix is huge, we never built it explicitly. However, our segmentation algorithm is guaranteed to converge to its principal eigenvector. The eigenvector is known to provide an optimal solution to an eigenvalue problem, which does not depend on initialization and, for this reason, is very robust in practice and able to find the foreground object in the video as the strongest cluster in the space-time graph. The approach in this chapter naturally relates to the principles of unsupervised learning presented in the introduction and also provides a first, lower level approach to coupling space and time processing. That is further explored at a higher semantic level in the final chapter, where we present our recurrent space-time graph neural network model (Nicolicioiu et al. Neural information processing systems, [1]). In practice, our method, termed GO-VOS, achieves state-of-the-art results on three difficult datasets in the literature: DAVIS, SegTrack, and YouTube-Objects.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.