Handling data skew at reduce stage in Spark by ReducePartition

Wenxia Guo,Wenhong Tian,Chaojie Huang

doi:10.1002/cpe.5637

Abstract

SummaryAs a typical representative of distributed computing framework, Spark has been continuously developed and popularized. It reduces the data transmission time through efficient memory‐based operations and solves the shortcomings of the traditional MapReduce computation model in iterative computation. In Spark, data skew is very prominent due to the uneven distribution of input data and the unbalanced allocation of default partitioning algorithm. When data skew occurs, the execution efficiency of the program will be reduced, especially in the reduce stage of Spark. Therefore, this paper proposes ReducePartition to solve data skew problem at reduce stage of Spark platform. First, the compute node samples the local data according to the sampling algorithm to predict the overall characteristics of data distribution. Then, to take full use of cluster resources, ReducePartition divides data into multiple partitions evenly. Next, taking into account the differences in computational capabilities among Executors, each task is assigned to Executor with the highest performance factor according to the greedy strategy. Finally, the results of the related algorithms and ReducePartition are compared by using WordCount benchmark and Sort benchmarks on heterogeneous Spark standalone cluster. The performance of the ReducePartition under different degree of data skew and different data size is analyzed. Experimental results show that the proposed algorithm can effectively reduce the impact of data skew on the total makespan of Spark big data applications, and the average total makespan is reduced by 30% to 50% while resource utilization is increased by 20%‐30% on average.

Full Text