Abstract
Hadoop MapReduce is a framework to process vast amounts of data in the cluster of machines in a reliable and fault-tolerant manner. Since being aware of the runtime of a job is crucial to subsequent decisions of this platform and being better management, in this paper we propose a new method to estimate the runtime of a job. For this purpose, after analysis the anatomy of processing a job in Hadoop MapReduce precisely, we consider two cases: when a job runs for the first time or a job has run previously. In the first case, by considering essential and efficient parameters that higher impact on runtime we formulate each phase of the Hadoop execution pipeline and state them by mathematical expressions to calculate runtime of a job. In the second case, by referring to the profile or history of a job in the database and use a weighting system the runtime is estimated. The results show the average error rate is less than 12% in the estimation of runtime for the first run and less than 8.5% when the profile or history of the job has existed.
Highlights
Nowadays, with the emergence and use of new systems, we face a massive amount of data
Since being aware of the runtime of a job is crucial to subsequent decisions, the contribution of this paper is to propose a new method to estimate the runtime of a job in Hadoop MapReduce version 2
First, we investigate the anatomy of Hadoop and its performance precisely in each stage, we consider two cases: when a job runs for the first time and there is not any history of it, or a job has run previously, and its profile or history is available
Summary
With the emergence and use of new systems, we face a massive amount of data. One of the famous open-source frameworks is Apache Hadoop [1]. It is a scalable and reliable framework for storage and process big data. Hadoop divides the big input data into fixed-size pieces; stores and processes these split of data on a cluster of machines. Hadoop stores its data in a distributed file system called HDFS. Hadoop is a data storage and processing platform, based upon two main concepts: HDFS and MapReduce. HDFS (Hadoop Distributed File System) is a file system to provide high throughput access to data and MapReduce is a framework for the parallel processing of large data sets. Hadoop works based on the master/slave style. There is a master node in the Hadoop cluster and many Slave nodes. Hadoop framework executes a job in a well-defined sequence of processing phases [1,2,3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.