Performance Study of Spark on YARN Cluster Using HiBench

Hooyoung Ahn,Woongshik You,Hyunjae Kim

doi:10.1109/icce-asia.2018.8552137

Abstract

Recently, various kinds of Internet-of-Things (IoT) solutions and services are provided such as smart industry, smart city, smart factory, smart agriculture and etc. Those solutions and services generate large amount of data from various devices which are connected through networks while they communicate with each other. However, it is a difficult problem to process the fast and massively produced data efficiently. To solve the problems in the framework level, there are many open-source big data processing and analysis frameworks. To process large-scale data in a fast manner, those frameworks use a cluster consisting of multiple computing machines. However, to set the framework running on large-scale cluster properly is not simple and it is difficult to verify its performance in the distributed environment. In this paper, we evaluate the performance of Apache Spark which is one of the most popular big data processing and analysis frameworks. Especially, we conduct experiments by using a representative benchmark tool, called HiBench, and large-scale data in the cluster environment. From the experimental results, we can conclude that Spark is highly scalable for distributed machine learning as well as big data processing.

Full Text