Abstract

The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single machine. Even distributed solutions encounter efficiency issues when dealing with terabyte-scale datasets. In this paper, we introduce a scalable sampling approach named CDFRS, which can generate samples with a distribution-preserving guarantee from extensive datasets. CDFRS exhibits significantly improved speed compared to existing sampling algorithms when dealing with terabyte-scale datasets. We provide theoretical guarantees and empirical justifications, demonstrating that samples generated by the CDFRS approach maintain the distribution characteristics of the original dataset. Additionally, we propose a sample size determination algorithm, denoted as A2. Experiment results indicate that the running time of CDFRS shows at least an order of magnitude improvement over other distributed sampling methods. Notably, sampling a 10TB dataset using CDFRS only takes hundreds of seconds, while the compared method requires more than ten thousand seconds. In the context of big data analysis, including tasks such as classification and clustering, models trained with samples generated by CDFRS closely match those trained with the entire training set. Furthermore, the proposed A2 algorithm efficiently determines an appropriate sample size compared with traditional methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call