Subspace clustering is a popular method to discover underlying low-dimensional structures of high-dimensional multimedia data (e.g., images, videos, and texts). In this article, we consider a large-scale subspace clustering (LS2C) problem, that is, partitioning million data points with a millon dimensions. To address this, we explore an independent distributed and parallel framework by dividing big data/variable matrices and regularization by both columns and rows. Specifically, LS2C is independently decomposed into many subproblems by distributing those matrices into different machines by columns since the regularization of the code matrix is equal to a sum of that of its submatrices (e.g., square-of-Frobenius/ l1 -norm). Consensus optimization is designed to solve these subproblems in a parallel way for saving communication costs. Moreover, we provide theoretical guarantees that LS2C can recover consensus subspace representations of high-dimensional data points under broad conditions. Compared with the state-of-the-art LS2C methods, our approach achieves better clustering results in public datasets, including a million images and videos.
Read full abstract