Abstract

Hadoop clusters are widely used distributed computing framework for big data processing. Yet Another Resource Negotiator (YARN) was introduced in Hadoop 2.0 and it provides container based resource partitioning and allocation to subdivided units of computation. Hadoop YARN in combination with Hadoop Distributed File System (HDFS) possess almost all the characteristics of a distributed operating system. A container consists of Java virtual machines initiated with dedicated allocation of memory and CPU shares. When jobs are split into small tasks and scheduled to run on containers created dynamically on the nodes of a cluster, the resource management overhead will have significant impact on the execution time of applications. This overhead depends on number of component tasks into which the job gets split. The work presented in this paper evaluates the resource management overhead in Hadoop YARN clusters. The results of this work helps users to select appropriate split level of jobs to minimize the overhead and maximize the performance of distributed applications deployed on the cluster. For evaluating the overhead, MapReduce jobs are run with identical parallelism in execution with varying split-sizes of input file having the same size. The resource manager overhead is estimated from the variation in completion time of the application at different split levels. A regression model is developed to estimate the execution time of jobs on a cluster from the size and split-size of the input file.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call