Cluster-Size Scaling and MapReduce Execution Times

Fan Zhang,Majd Sakr

doi:10.1109/cloudcom.2013.39

Abstract

Understanding performance scalability in MapReduce applications presents a challenging problem. The difficulty lies in the distributed locations of input data and the distributed compute resources that utilize varied network substrates. User-defined Map and Reduce stages, with numerous application parameters, further complicate the problem. Using small datasets and limited test runs to understand how MapReduce applications will behave with "big data" can have a significant payoff. In this paper, we evaluate the impact of cluster-size scaling on execution time for a set of Map- and Reduce-intensive applications. We model the MapReduce framework, specify conditions and implications of power-law conformity, and verify our model with data from benchmark MapReduce applications. Empirical results indicate that: (1) within a range of scaling parameters, MapReduce execution times follow a power-law distribution. (2) Power-law scalability for Map-intensive applications starts from a small cluster size. (3) Shuffle-intensive applications exhibit power-law behavior starting from larger clusters. (4) Cluster-scaling performance gains fail to show power-law behavior when computing resources far exceed those needed. Our findings will facilitate using small-scale test runs to allocate and configure virtual and physical computing resources in large scale clouds.

Full Text