Optimization of Computing and Networking Resources of a Hadoop Cluster Based on Software Defined Network

Ali Khaleel,Hamed Al-Raweshidy

doi:10.1109/access.2018.2876385

Ali Khaleel, Hamed Al-Raweshidy

Open Access

https://doi.org/10.1109/access.2018.2876385

Copy DOI

Abstract

In this paper, we discuss some challenges regarding the Hadoop framework. One of the main ones is the computing performance of Hadoop MapReduce jobs in terms of CPU, memory, and hard disk I/O. The networking side of a Hadoop cluster is another challenge, especially for large-scale clusters with many switch devices and computing nodes, such as a data center network. The configurations of Hadoop MapReduce parameters can have a significant impact on the computing performance of a Hadoop cluster. All issues relating to Hadoop MapReduce parameter settings are addressed. Some significant parameters of Hadoop MapReduce are tuned using a novel intelligent technique based on both genetic programming and a genetic Algorithm, with the aim of optimizing the performance of a Hadoop MapReduce job. The Hadoop framework has more than 150 configurations of parameters and hence, setting them manually is not difficult, but also time-consuming. Consequently, the above-mentioned algorithms are used to search for the optimum values of parameter settings. The software-defined network (SDN) is also employed to improve the networking performance of a Hadoop cluster, thus accelerating Hadoop jobs. Experiments have been carried out on two typical applications of Hadoop, including a Word Count Application and Tera Sort application, using 14 virtual machines in both a traditional network and an SDN. The results for the traditional network show that our proposed technique improves MapReduce jobs’ performance for 20 GB with the Word Count application by 69.63% and 30.31% when compared to the default and Gunther work, respectively. While for the Tera Sort application, the performance of Hadoop MapReduce is improved by 73.39% and 55.93%, compared with the default and Gunther work, respectively. Moreover, the experimental results in an SDN environment showed that the performance of a Hadoop MapReduce job is further improved due to the advantages of the intelligent and centralized management achieved using it. Another experiment has been conducted to evaluate the performance of Hadoop jobs using a large-scale cluster in a data center network, also based on SDN, with the results revealing that this exceeded the performance of a conventional network.

Full Text