A Comprehensive Review on Data Partitioning and Sampling Techniques for Processing Big Data

C G Anupama,C Lakshmi

doi:10.1109/icpects56089.2022.10047766

Abstract

Big data processing and analysis use shared-nothing computer clusters. Cluster computing relies heavily on data segmentation and sampling to boost the speed and scalability of large data computations. In this study, we provide a thorough review of sampling and data division approaches applicable to large data processing and analysis. Next, we'll go over some of the fundamentals of data partitioning, such as the difference between range partitioning, hash partitioning, and random partitioning. Then, we get into the standard data sampling techniques such stratified sampling, reservoir sampling, and simple random sampling and approaches suitable for clusters. Our proposal is to take both data partitioning and sampling into consideration simultaneously while processing large data sets in parallel environment.

Full Text