A survey of data partitioning and sampling methods to support big data analysis

Mohammad Sultan Mahmud,Joshua Zhexue Huang,Kuanishbay Sadatdiynov,Salman Salloum,Tamer Z Emara

doi:10.26599/bdma.2019.9020015

Mohammad Sultan Mahmud, Joshua Zhexue Huang + Show 3 more

Open Access

https://doi.org/10.26599/bdma.2019.9020015

Copy DOI

Journal: Big Data Mining and Analytics	Publication Date: Jun 1, 2020
Citations: 175	License type: cc-by

Affiliation: Shenzhen University

Abstract

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

Highlights

An overwhelming volume of data is being generated from business transactions, computerC The author(s) 2020
The survey presented in this paper gives a concise summary of the most common methods of partitioning and sampling to support big data analysis on Hadoop clusters
Big Data Mining and Analytics, June 2020, 3(2): 85–101 sampling-based approximate big data analysis, we present a concise overview of these methods with respect to big data on Hadoop clusters

Summary

Introduction

An overwhelming volume of data is being generated from business transactions, computerC The author(s) 2020. The MapReduce computing model[5] is used to apply this strategy in the mainstream big data analysis frameworks[6,7,8,9], such as Apache Hadoop (http://hadoop.apache.org/) and Apache Spark (http://spark.apache.org/). These frameworks implement a shared-nothing architecture (https://www.oreilly.com/learning/processing-data-inhadoop) where each node is independent in terms of both data and resources. Studies have shown that when the data size is large enough, parallelization based on distributed data blocks can result in a linear speed-up as computing resources increase in the cluster[11]. Scaling-out a computing cluster requires additional costs and the necessary investment may not be always available in practice[12]

Objectives

Methods

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A survey of data partitioning and sampling methods to support big data analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Big Data Mining and Analytics

Lead the way for us

Similar Papers

A Comprehensive Review on Data Partitioning and Sampling Techniques for Processing Big Data
C G Anupama ... C Lakshmi
-
C G Anupama, et. al.C G Anupama ... C Lakshmi
08 Dec 2022
08 Dec 2022

Analisa Performa Klastering Data Besar pada Hadoop
Hadian Mandala Putra ... Muhammad Iman Darmawan
Infotek : Jurnal Informatika dan Teknologi | VOL. 4
Hadian Mandala Putra, et. al.Hadian Mandala Putra ... Muhammad Iman Darmawan
31 Jul 2021
Infotek : Jurnal Informatika dan Teknologi | VOL. 4

빅데이터의 효과적인 처리 및 활용을 위한 클라이언트-서버 모델 설계
Dae Seo Park ... Hwa Jong Kim
Journal of Intelligence and Information Systems | VOL. 22
Dae Seo Park, et. al.Dae Seo Park ... Hwa Jong Kim
31 Dec 2016
Journal of Intelligence and Information Systems | VOL. 22

Big data processing and analysis platform for condition monitoring of electric power system
Yuanjun Guo ... Yong Wang
-
Yuanjun Guo, et. al.Yuanjun Guo ... Yong Wang
01 Aug 2016
01 Aug 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A survey of data partitioning and sampling methods to support big data analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Big Data Mining and Analytics