Abstract

Different data transmission times, processing times which are difficult to predict and node-dependent access times make MapReduce task scheduling rather complex. In this article, we consider the problem of scheduling MapReduce tasks to heterogeneous geo-distributed data centers to minimize the total tardiness. A new architecture is constructed to analyze data in the considered scenario. We model distinct data transmission levels, inter- and intra- data centers and heterogeneity of nodes mathematically. An algorithm framework is proposed to schedule MapReduce tasks to heterogeneous nodes in geographically distributed data centers. The proposed algorithm is suitable for both Hadoop MRv1 and MRv2. In terms of the number of idle containers detected in each heartbeat, the same number of tasks are selected from a sorted job sequence. For the map and reduce phases, two measurements are developed with data locality and completion time, respectively, based on which the classical Hungarian algorithm is adopted to optimally assign selected tasks to corresponding idle containers. Components and parameters of the proposal are statistically calibrated over a large set of random instances. A comparison of the proposed algorithm to existing methods for similar problems is carried out. Experimental results demonstrate the proposal is effective for the considered problem.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call