Abstract

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

Highlights

  • Data-parallel frameworks such as MapReduce [1], Hadoop [2], Spark [3], Pregel [4], and TensorFlow [5]have emerged as important components in big data-processing ecosystems

  • We propose a novel data locality scheduling model with dynamic data transfer costs for multicore servers, and develop online and offline algorithms for the model

  • This paper studies a fundamental problem for data-parallel frameworks: data-locality-aware task scheduling

Read more

Summary

Introduction

Data-parallel frameworks such as MapReduce [1], Hadoop [2], Spark [3], Pregel [4], and TensorFlow [5]. Have emerged as important components in big data-processing ecosystems. The Spark deployment at Facebook processes tens of petabytes of newly-generated data every day, and a single job can process hundreds of terabytes of data [6]. Because data-parallel frameworks process terabytes or petabytes of data on hundreds or thousands of servers, the costs of transferring data between. Sci. 2018, 8, 2216 servers significantly affect the frameworks’ performance. Data locality becomes a fundamental problem for all data-parallel frameworks

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call