RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration

Zhendong Bei,Lieven Eeckhout,Huiling Zhang,Shengzhong Feng,Wen Xiong,Chengzhong Xu,Zhibin Yu

doi:10.1109/tpds.2015.2449299

Abstract

Hadoop is a widely-used implementation framework of the MapReduce programming model for large-scale data processing. Hadoop performance however is significantly affected by the settings of the Hadoop configuration parameters. Unfortunately, manually tuning these parameters is very time-consuming, if at all practical. This paper proposes an approach, called RFHOC, to automatically tune the Hadoop configuration parameters for optimized performance for a given application running on a given cluster. RFHOC constructs two ensembles of performance models using a random-forest approach for the map and reduce stage respectively. Leveraging these models, RFHOC employs a genetic algorithm to automatically search the Hadoop configuration space. The evaluation of RFHOC using five typical Hadoop programs, each with five different input data sets, shows that it achieves a performance speedup by a factor of 2.11 $\times$ on average and up to 7.4 $\times$ over the recently proposed cost-based optimization (CBO) approach. In addition, RFHOC's performance benefit increases with input data set size.

Full Text