In the subspace clustering with missing data (SCMD) problem, we are given a collection of n partially observed d-dimensional vectors. The data points are assumed to be concentrated near a union of low-dimensional subspaces. The goal of SCMD is to cluster the vectors according to their subspace membership and recover the underlying basis, which can then be used to infer their missing entries. State-of-the-art algorithms for SCMD can fail on instances with a high proportion of missing data, with full-rank data, or if the underlying subspaces are similar to each other. We propose a novel integer programming approach for SCMD. The approach is based on dynamically determining a set of candidate subspaces and optimally assigning points to selected subspaces. The problem structure is identical to the classical facility-location problem, with subspaces playing the role of facilities and data points playing that of customers. We propose a column-generation approach for identifying candidate subspaces combined with a Benders decomposition approach for solving the linear programming relaxation of the formulation. An empirical study demonstrates that the proposed approach can achieve better clustering accuracy than state-of-the-art methods when the data are high rank, the percentage of missing data is high, or the subspaces are similar. Funding: Support for this research was provided by American Family Insurance through a research partnership with the University of Wisconsin–Madison’s Data Science Institute. Supplemental Material: The online appendix is available at https://doi.org/10.1287/ijoo.2023.0020 .
Read full abstract