A distributed data management system to support large-scale data analysis

Tamer Z Emara,Joshua Zhexue Huang

doi:10.1016/j.jss.2018.11.007

Abstract

Distributed data management is a key technology to enable efficient massive data processing and analysis in cluster-computing environments. Specifically, in environments where the data volumes are beyond the system capabilities, big data files are required to be summarized by representative samples with the same statistical properties as the whole dataset. This paper proposes a big data management system (BDMS) based on distributed random sample data blocks. It presents a high-level architecture design of the BDMS which extends the current distributed file systems. This system offers certain functionalities for block-level management such as statistically-aware data partitioning, data blocks organization, and data blocks selection. This paper also presents a round-random partitioning scheme to represent a big dataset as a set of non-overlapping data blocks; each block is a random sample of the whole dataset. Based on the presented scheme, two algorithms are introduced as an implementation strategy to convert the HDFS blocks of a big file into a set of random sample data blocks which is also stored in HDFS. The experimental results show that the execution time of partitioning operation is acceptable in the real applications because this operation is only performed once on each input data file.

Full Text