Abstract

Big data processing and analysis use shared-nothing computer clusters. Cluster computing relies heavily on data segmentation and sampling to boost the speed and scalability of large data computations. In this study, we provide a thorough review of sampling and data division approaches applicable to large data processing and analysis. Next, we'll go over some of the fundamentals of data partitioning, such as the difference between range partitioning, hash partitioning, and random partitioning. Then, we get into the standard data sampling techniques such stratified sampling, reservoir sampling, and simple random sampling and approaches suitable for clusters. Our proposal is to take both data partitioning and sampling into consideration simultaneously while processing large data sets in parallel environment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call