Hadoop Performance Self-Tuning Using a Fuzzy-Prediction Approach

Gil Jae Lee,Jose A B Fortes

doi:10.1109/icac.2016.52

Abstract

The Apache Hadoop framework (currently known as YARN) is a widely used open-source implementation of MapReduce (MR). Manual tuning of Hadoop performance is hard and time-consuming so several self-tuning approaches have been proposed. This paper proposes an approach that avoids problems of previous self-tuning approaches based on performance models or resource usage, namely 1) need for a time-consuming training phase, typically offline, 2) unsuitability for Hadoop environments with concurrently running MR jobs, and 3) need for modification of the Hadoop framework itself. The proposed approach uses a fuzzy-prediction controller for self-optimization of the number of concurrent MR jobs. The fuzzy-prediction controller learns from past and current resource usage of MR jobs and from the number of concurrent tasks. It both uses and constructs rules in real time to predict the resource usage and the number of concurrent tasks. It does not require offline training or any modification of either the MR jobs or the Hadoop framework. The predicted values are used to dynamically control the number of concurrent ApplicationMasters (AMs) (i.e., MR jobs in RUNNING state). Experimental evaluation of the proposed approach on a 7-node cluster (1 master node and 6 slave nodes) running 30-job sequences combining three different types of MR jobs (Terasort, Grep and Wordcount) showed up to 29% performance improvement over Hadoop default configurations. The new approach improves the aggregate performThe Apache Hadoop framework (currently known as YARN) is a widely used open-source implementation of MapReduce (MR). Manual tuning of Hadoop performance is hard and time-consuming so several self-tuning approaches have been proposed. This paper proposes an approach that avoids problems of previous self-tuning approaches based on performance models or resource usage, namely 1) need for a time-consuming training phase, typically offline, 2) unsuitability for Hadoop environments with concurrently running MR jobs, and 3) need for modification of the Hadoop framework itself. The proposed approach uses a fuzzyprediction controller for self-optimization of the number of concurrent MR jobs. The fuzzy-prediction controller learns from past and current resource usage of MR jobs and from the number of concurrent tasks. It both uses and constructs rules in real time to predict the resource usage and the number of concurrent tasks. It does not require offline training or any modification of either the MR jobs or the Hadoop framework. The predicted values are used to dynamically control the number of concurrent ApplicationMasters (AMs) (i.e., MR jobs in RUNNING state). Experimental evaluation of the proposed approach on a 7-node cluster (1 master node and 6 slave nodes) running 30-job sequences combining three different types of MR jobs (Terasort, Grep and Wordcount) showed up to 29% performance improvement over Hadoop default configurations. The new approach improves the aggregate performance of MR jobs by adjusting a single YARN parameter.ance of MR jobs by adjusting a single YARN parameter.

Full Text