Abstract

Improving the performance of MapReduce scheduler is a primary objective, especially in a heterogeneous virtualized cloud environment. A map task is typically assigned with an input split, which consists of one or more data blocks. When a map task is assigned to more than one data block, non-local execution is performed. In classical MapReduce scheduling schemes, data blocks are copied over the network to a node where the map task is running. This increases job latency and consumes considerable network bandwidth within and between racks in the cloud data centre. Considering this situation, we propose a methodology, “improving data locality using ant colony optimization (IDLACO),” to minimize the number of non-local executions and virtual network bandwidth consumption when input split is assigned to more than one data block. First, IDLACO determines a set of data blocks for each map task of a MapReduce job to perform non-local executions to minimize the job latency and virtual network consumption. Then, the target virtual machine to execute map task is determined based on its heterogeneous performance. Finally, if a set of data blocks is transferred to the same node for repeated job execution, it is decided to temporarily cache them in the target virtual machine. The performance of IDLACO is analysed and compared with fair scheduler and Holistic scheduler based on the parameters, such as the number of non-local executions, average map task latency, job latency, and amount of bandwidth consumed for a MapReduce job. Results show that IDLACO significantly outperformed the classical fair scheduler and Holistic scheduler.

Highlights

  • Collecting big data is becoming more common in academia, industry, and research sectors

  • Hadoop MapReduce is widely used as a service by different cloud service providers

  • When an input split (IS) is assigned with a greater number of data blocks, non-local executions (NNLE) increases, leading to high bandwidth consumption and job latency

Read more

Summary

Introduction

Collecting big data is becoming more common in academia, industry, and research sectors. (ii) VMs in the virtual cluster could belong to different flavours, such as small, medium, and large, and be hosted in heterogeneous physical machines. (iii) Hardware heterogeneity, VM heterogeneity, and co-locating VMs interference together cause heterogeneous performance of same map/reduce task running in a VM It is called performance heterogeneity, which makes VM performance unpredictable at the infrastructure level. A low-performing VM might receive a greater number of data blocks to process, while a high-performing VM might receive very less number This increases the map task latency; thereby, increasing the MapReduce job latency. Even though it minimizes the number of map tasks, it can increase the number of non-local executions [3] and local bandwidth consumption at the MapReduce level

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call