Investigation of Data Locality in MapReduce

Zhenhua Guo,Mo Zhou,Geoffrey Fox

doi:10.1109/ccgrid.2012.42

Abstract

Traditional HPC architectures separate compute nodes and storage nodes, which are interconnected with high speed links to satisfy data access requirements in multi-user environments. However, the capacity of those high speed links is still much less than the aggregate bandwidth of all compute nodes. In Data Parallel Systems such as GFS/MapReduce, clusters are built with commodity hardware and each node takes the roles of both computation and storage, which makes it possible to bring compute to data. Data locality is a significant advantage of data parallel systems over traditional HPC systems. Good data locality reduces cross-switch network traffic - one of the bottlenecks in data-intensive computing. In this paper, we investigate data locality in depth. Firstly, we build a mathematical model of scheduling in MapReduce and theoretically analyze the impact on data locality of configuration factors, such as the numbers of nodes and tasks. Secondly, we find the default Hadoop scheduling is non-optimal and propose an algorithm that schedules multiple tasks simultaneously rather than one by one to give optimal data locality. Thirdly, we run extensive experiments to quantify performance improvement of our proposed algorithms, measure how different factors impact data locality, and investigate how data locality influences job execution time in both single-cluster and cross-cluster environments.

Full Text