Best Trade-Off Point Method for Efficient Resource Provisioning in Spark

Peter Nghiem

doi:10.3390/a11120190

Abstract

Considering the recent exponential growth in the amount of information processed in Big Data, the high energy consumed by data processing engines in datacenters has become a major issue, underlining the need for efficient resource allocation for more energy-efficient computing. We previously proposed the Best Trade-off Point (BToP) method, which provides a general approach and techniques based on an algorithm with mathematical formulas to find the best trade-off point on an elbow curve of performance vs. resources for efficient resource provisioning in Hadoop MapReduce. The BToP method is expected to work for any application or system which relies on a trade-off elbow curve, non-inverted or inverted, for making good decisions. In this paper, we apply the BToP method to the emerging cluster computing framework, Apache Spark, and show that its performance and energy consumption are better than Spark with its built-in dynamic resource allocation enabled. Our Spark-Bench tests confirm the effectiveness of using the BToP method with Spark to determine the optimal number of executors for any workload in production environments where job profiling for behavioral replication will lead to the most efficient resource provisioning.

Highlights

The Gartner Inc. research firm has forecasted that the rapidly-growing cloud ecosystem will have up to 25 billion IoT sensor devices connected by 2020 [1]
There is an abundance of on-going development and research work to improve many features of Spark, we focus on the dynamic resource allocation mechanism in this paper, since it is the main feature directly related to the objective of the Best Trade-off Point (BToP) method
Many features of Spark, we focus on the dynamic resource allocation mechanism in this paper, 7since of 26 it is the main feature directly related to the objective of the BToP method

Summary

Introduction

The Gartner Inc. research firm has forecasted that the rapidly-growing cloud ecosystem will have up to 25 billion IoT sensor devices connected by 2020 [1] This large number of devices will generate hundreds of zettabytes of information in the cloud to be analyzed by Big Data processing engines, such as Hadoop MapReduce and Apache Spark, to deliver practical value in business, technology, and manufacturing processes for better innovation and more intelligent decisions. Datacenter electricity consumption is projected to increase to roughly 140 billion kilowatt-hours annually by 2020, the equivalent annual output of 50 power plants, costing American businesses $13 billion annually in electricity bills, and emitting nearly 100 million metric tons of carbon pollution per year [4] This energy expense, largely incurred for Big Data processing, could be reduced through more efficient resource provisioning in MapReduce and Spark among other data processing frameworks

Objectives

Methods

Findings

Conclusion