Apache Hadoop Yarn Parameter configuration Challenges and Optimization

Bhavin J Mathiya,Vinodkumar L Desai

doi:10.1109/icsns.2015.7292373

Abstract

Apache Hadoop Yarn is an open source framework for distributed as well as local storage, processing and analysis of big data on commodity hardware. It provides MapReduce Programming Model, HDFS for Distributed File System and Default parameter configuration settings. MapReduce Programming Model provides Mapper and Reducer interface function for parallel computing, processing and execution of program. HDFS provides file system for storing data locally and distributed. Apache Hadoop provides more than hundreds default parameter configuration settings common for all type of clusters and applications. Apache Hadoop Yarn provides functionality that user can customize parameter configuration settings as per their needs through xml file setting as well as in writing program coding for performance tuning of resources likes CPU, I/O, Memory, and Network. Customizing parameter configuration is a black art which required good knowledge of each parameter that what is impact when we change its default values because all parameter are interconnected and affected each other performance. Proper parameter configuration can improve and tune performance as well as misparameter configuration setting can decrease performance of the system. It is challenge that performance tuning of Apache Hadoop Framework through balanced customize parameter configuration setting that it cannot over utilize or under utilize system resources. In this paper we study and analysis of difference type of research paper related to customizing parameter configurations setting for performance tuning of Apache Hadoop jobs and better utilization of available resources. We found that good customization of parameter configuration improve performance compare to default parameter setting.

Full Text