Empirical Discovery of Power-Law Distribution in MapReduce Scalability

Fan Zhang,Samee U Khan,Kai Hwang,Majd F Sakr

doi:10.1109/tcc.2017.2669320

Abstract

Understanding the scalability of MapReduce applications is a challenging problem. The difficulty lies in the distributed mapping of the input big data. The distribution of data and compute resources must match with fluctuating network substrates. User-defined Map and Reduce functions over application parameters further complicate the issue. Therefore, it offers great payoff to use small datasets and limited test runs to reveal the behavior of MapReduce applications over big-data. In this paper, we analyze the scaling effects of server cluster-size over varieties of Map- and Reduce-intensive applications. In our study, we discover specific conditions which lead to the power-law conformity in representative MapReduce applications. We report four major discoveries: (1) Within a range of scaling parameters, MapReduce execution time follows the power-law distribution. (2) Power-law scalability for Map-intensive applications work well even with a small cluster size. (3) Shuffle-intensive applications exhibit power-law behavior starting from larger cluster size. (4) The scaling effects may depart from power-law distribution, if the cloud resources are heavily overprovisioned than the workload demands. The above findings enable users to use bounded test runs to allocate and configure virtual and physical resources in large-scale MapReduce applications. These results can be also applied in generating business models for providing cost-effective cloud computing services.

Full Text