Robustness Comparison of Scheduling Algorithms in MapReduce Framework

Amirali Daghighi,Jim Q Chen

doi:10.1007/978-3-030-80119-9_30

Abstract

AbstractParallel computing is the fundamental base for MapReduce framework in Hadoop. A big data is split into small data chunks, where Map task is referred to processing a data chunk. Each data chunk is replicated over three servers in Hadoop for increasing the availability of data and decreasing the probability of data loss. As a result, the three servers that have the Map task stored on their disk are the fastest servers to process them, which are called local servers. All the servers in the same rack as local servers are called rack-local servers that are slower than local servers in processing Map tasks since the data chunk associated with the Map task should be fetched through the top of the rack switch. All the other servers are called remote servers that are the slowest servers for processing a Map task since they need to fetch data from a local server in another rack, so data should be transmitted through at least two top of the rack switches and a core switch. Note that the number of switches in the path of data transfer depends on the internal network structure of data centers. The First-In-First-Out (FIFO) and Hadoop Fair Scheduler (HFS) algorithms do not take the rack structure of data centers into account, so they are known to not be heavy-traffic delay optimal or even throughput optimal. The recent advances on scheduling for data centers considering the rack structure of them and the heterogeneity of servers resulted in the state-of-the-art Balanced-PANDAS algorithm that outperforms the classic MaxWeight algorithm and its derivation, JSQ-MaxWeight algorithm. In both Balanced-PANDAS and MaxWeight algorithms, the processing rate of local, rack-local, and remote servers are assumed to be known. However, with the change of traffic over time in addition to estimation errors of processing rates, it is not realistic to consider the processing rates to be known. In this work, we study the robustness of Balanced-PANDAS and MaxWeight algorithms in terms of inaccurate estimations of processing rates. We observe that Balanced-PANDAS is not as sensitive as MaxWeight on the accuracy of processing rates, making it more appealing to use in data centers.KeywordsHadoopMapReduceData centerSchedulingLoad balancingRobustness

Full Text