Estimating runtime of a job in Hadoop MapReduce

Narges Peyravi,Ali Moeini

doi:10.1186/s40537-020-00319-4

Narges Peyravi, Ali Moeini

Open Access

https://doi.org/10.1186/s40537-020-00319-4

Copy DOI

Abstract

Hadoop MapReduce is a framework to process vast amounts of data in the cluster of machines in a reliable and fault-tolerant manner. Since being aware of the runtime of a job is crucial to subsequent decisions of this platform and being better management, in this paper we propose a new method to estimate the runtime of a job. For this purpose, after analysis the anatomy of processing a job in Hadoop MapReduce precisely, we consider two cases: when a job runs for the first time or a job has run previously. In the first case, by considering essential and efficient parameters that higher impact on runtime we formulate each phase of the Hadoop execution pipeline and state them by mathematical expressions to calculate runtime of a job. In the second case, by referring to the profile or history of a job in the database and use a weighting system the runtime is estimated. The results show the average error rate is less than 12% in the estimation of runtime for the first run and less than 8.5% when the profile or history of the job has existed.

Highlights

Nowadays, with the emergence and use of new systems, we face a massive amount of data
Since being aware of the runtime of a job is crucial to subsequent decisions, the contribution of this paper is to propose a new method to estimate the runtime of a job in Hadoop MapReduce version 2
First, we investigate the anatomy of Hadoop and its performance precisely in each stage, we consider two cases: when a job runs for the first time and there is not any history of it, or a job has run previously, and its profile or history is available

Summary

Introduction

With the emergence and use of new systems, we face a massive amount of data. One of the famous open-source frameworks is Apache Hadoop [1]. It is a scalable and reliable framework for storage and process big data. Hadoop divides the big input data into fixed-size pieces; stores and processes these split of data on a cluster of machines. Hadoop stores its data in a distributed file system called HDFS. Hadoop is a data storage and processing platform, based upon two main concepts: HDFS and MapReduce. HDFS (Hadoop Distributed File System) is a file system to provide high throughput access to data and MapReduce is a framework for the parallel processing of large data sets. Hadoop works based on the master/slave style. There is a master node in the Hadoop cluster and many Slave nodes. Hadoop framework executes a job in a well-defined sequence of processing phases [1,2,3]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Jul 6, 2020
Citations: 9	License type: open-access

R Discovery Prime

R Discovery Prime

Estimating runtime of a job in Hadoop MapReduce

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

A New Scheduling Algorithm in Hadoop MapReduce
Zhiping Peng ... Yanchun Ma
-
Zhiping Peng, et. al.Zhiping Peng ... Yanchun Ma
01 Jan 2010
01 Jan 2010

An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce

-

01 Jan 2014
01 Jan 2014

Big data gathering and mining pipelines for CRM using open-source
Kang Li ... Vinay Deolalikar
-
Kang Li, et. al.Kang Li ... Vinay Deolalikar
01 Oct 2015
01 Oct 2015

A Study on the Performance and Scalability of Apache Flink Over Hadoop MapReduce
Pankaj Lathar ... K G Srinivasa
International Journal of Fog Computing | VOL. 2
Pankaj Lathar, et. al.Pankaj Lathar ... K G Srinivasa
01 Jan 2019
International Journal of Fog Computing | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Estimating runtime of a job in Hadoop MapReduce

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data