Diversity Subsampling: Custom Subsamples from Large Data Sets

Boyang Shang,Sanjay Mehrotra,Daniel W Apley

doi:10.1287/ijds.2022.00017

Boyang Shang, Sanjay Mehrotra + Show 1 more

Open Access

PDF Available

https://doi.org/10.1287/ijds.2022.00017

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Subsampling from a large unlabeled (i.e., no response values are available yet) data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. In this paper, we borrow concepts from the well-known sampling/importance resampling technique, which samples from a specified probability distribution, to develop a diversity subsampling approach that selects a subsample from the original data with no prior knowledge of its underlying probability distribution. The goal is to produce a subsample that is independently and uniformly distributed over the support of distribution from which the data are drawn, to the maximum extent possible. We give an asymptotic performance guarantee of the proposed method and provide experimental results to show that the proposed method performs well for typical finite-size data. We also compare the proposed method with competing diversity subsampling algorithms and demonstrate numerically that subsamples selected by the proposed method are closer to a uniform sample than subsamples selected by other methods. The proposed diversity subsampling (DS) algorithm is more efficient than known methods. It takes only a few minutes to select tens of thousands of subsample points from a data set of size one million. Our DS algorithm easily generalizes to select subsamples following distributions other than uniform. We provide a Python package (FADS) that implements the proposed method. History: Kwok-Leung Tsui served as the senior editor for this article. Funding: This work was supported by the National Science Foundation [Grant CMMI-1436574], Northwestern University, the Advanced Research Projects Agency-Energy, and the U.S. Department of Energy [Award DE-AR0001209]. Data Ethics & Reproducibility Note: No data ethics considerations are foreseen related to this article. The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.8309237.v3 and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2022.00017 ).

Full Text