Abstract

We propose a novel technique for dataset summarization by selecting representatives from a large, unsupervised dataset. The approach is based on the concept of self-rank, defined as the minimum number of samples needed to express all dataset samples with an accuracy proportional to the rank-<i>K</i> approximation. As the exact computation of self-rank requires a computationally expensive combinatorial search, we propose an efficient algorithm that jointly estimates self-rank and selects the most informative samples in a linear order of complexity w.r.t the data size. We derive a new upper bound for the approximation ratio (AR), the ratio of obtained projection error using selected samples to the best rank-<i>K</i> approximation error. The best previously known AR for self-representative low-rank approximation was presented in ICML 2017 [1], which was further improved by the bound &#x221A;1 + <i>K</i> reported in NeurIPS 2019 [2]. Both of these bounds are obtained by brute force search, which is not practical, and these bounds depend solely on <i>K</i>, the number of selected samples. In contrast, we describe an adaptive AR that takes into consideration the spectral properties and spikiness measure of the original dataset, A &#x03F5; R<sup><i>N</i>&#x00D7;<i>M</i></sup>. In particular, our performance bound is proportional to the condition number &#x03BA;(<i>A</i>). Our derived AR is expressed as 1+(&#x03BA;(<i>A</i>)<sup>2</sup> - 1)/(<i>N</i> - <i>K</i>), which approaches 1 and is optimal in two extreme spectral distribution instances. In the worst case, AR is shown not to exceed 1.25 for the proposed method. Our proposed algorithm enjoys linear complexity w.r.t. size of original dataset, which results in filling a historical gap between practical and theoretical methods in finding representatives. In addition to evaluating the proposed algorithm on a synthetic dataset, we show that it can be utilized in real-world applications such as graph node sampling for optimizing the shortest path criterion, learning a classifier with representative data, and open-set identification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.