With the rise of cloud computing, the internet of things, social networks and other technologies, information data is expanding rapidly, and traditional processing and storage systems have been difficult to deal with massive data. Spark is a fast and efficient MapReduce implementation developed after Hadoop. However, the shuffle operation in spark will cause the data set on some working nodes to be too large, while other nodes may be idle, which will affect the performance of spark jobs. This phenomenon is called data skew. Aiming at the problem of data skew in spark platform, this paper proposes a load balancing mechanism based on linear regression partition prediction: SP-LRP (Spark load balancing mechanism based on Linear Regression Partition), SP-LRP predicts the partition size of the reduce tasks at run-time, leverages the skew detection algorithm to identify the skew partition and adjusts the task resource allocation according to the fine-grained resource allocation algorithm. We use the benchmark dataset to evaluate SP-LRP, and compare the average execution time of two algorithms and the load balancing degree of reducers in different situations. The experimental confirmed the efficiency of SP-LRP in their respective usage scenarios.
Read full abstract