Spectrum Pursuit With Residual Descent for Column Subset Selection Problem: Theoretical Guarantees and Applications in Deep Learning

Saeed Vahidian,Ashkan Esmaeili,Siavash Khodadadeh,Sharareh Zehtabian,Bill Lin,Mohsen Joneidi

doi:10.1109/access.2022.3198597

Saeed Vahidian, Ashkan Esmaeili + Show 4 more

Open Access

https://doi.org/10.1109/access.2022.3198597

Copy DOI

Abstract

We propose a novel technique for dataset summarization by selecting representatives from a large, unsupervised dataset. The approach is based on the concept of self-rank, defined as the minimum number of samples needed to express all dataset samples with an accuracy proportional to the rank-K approximation. As the exact computation of self-rank requires a computationally expensive combinatorial search, we propose an efficient algorithm that jointly estimates self-rank and selects the most informative samples in a linear order of complexity w.r.t the data size. We derive a new upper bound for the approximation ratio (AR), the ratio of obtained projection error using selected samples to the best rank-K approximation error. The best previously known AR for self-representative low-rank approximation was presented in ICML 2017 [1], which was further improved by the bound √1 + K reported in NeurIPS 2019 [2]. Both of these bounds are obtained by brute force search, which is not practical, and these bounds depend solely on K, the number of selected samples. In contrast, we describe an adaptive AR that takes into consideration the spectral properties and spikiness measure of the original dataset, A ϵ RN×M. In particular, our performance bound is proportional to the condition number κ(A). Our derived AR is expressed as 1+(κ(A)2 - 1)/(N - K), which approaches 1 and is optimal in two extreme spectral distribution instances. In the worst case, AR is shown not to exceed 1.25 for the proposed method. Our proposed algorithm enjoys linear complexity w.r.t. size of original dataset, which results in filling a historical gap between practical and theoretical methods in finding representatives. In addition to evaluating the proposed algorithm on a synthetic dataset, we show that it can be utilized in real-world applications such as graph node sampling for optimizing the shortest path criterion, learning a classifier with representative data, and open-set identification.

Full Text