Abstract

Big Data processing systems (e.g., Spark) have a number of resource configuration parameters, such as memory size, CPU allocation, and the number of running nodes. Regular users and even expert administrators struggle to understand the mutual relation between different parameter configurations and the overall performance of the system. In this paper, we address this challenge by proposing a performance prediction framework, called d-Simplexed, to build performance models with varied configurable parameters on Spark. We take inspiration from the field of Computational Geometry to construct a d-dimensional mesh using Delaunay Triangulation over a selected set of features. From this mesh, we predict execution time for various feature configurations. To minimize the time and resources in building a bootstrap model with a large number of configuration values, we propose an adaptive sampling technique to allow us to collect as few training points as required. Our evaluation on a cluster of computers using WordCount, PageRank, Kmeans, and Join workloads in HiBench benchmarking suites shows that we can achieve less than 5% error rate for estimation accuracy by sampling less than 1% of data.

Highlights

  • Numerous Big Data frameworks have been introduced to address the problem of organizing large-scale fault-tolerant computation in a clustered environment

  • We propose a framework, called dSimplexed, by using the Delaunay Triangulation (DT) model to make the prediction for a given parameter configuration and heuristic adaptive sampling to reducing samples for training

  • We introduce the following main steps to build and use the DT model to fit our problem of performance modeling and prediction: 1) Triangulation: Given a set of d features {f1, f2, ..., fd}, e.g., {16 GB, 4 vcores}, we build the Delaunay Triangulation model in Rd space; 2) Projection: From each d-simplex returned from a Delaunay Triangulation model, we use the running-times of each of the (d + 1) points to compute the hyperplanes; 3) Prediction: Given a new parameter configuration, we can make the running-time prediction based on the model constructed before

Read more

Summary

INTRODUCTION

Alvaro [16], OtterTune [1], and CDBTune [42] use several regressors to tune a set of parameters They train the models in the way of maximizing one objective, i.e., predicting local optimal performance. Training the whole topography is extravagant with a large parameter space, and randomly picking samples does not guarantee the desired accuracy Determining both the right fraction and the appropriate representatives of the samples for building a model is not trivial.

Spark Preliminaries
Delaunay Triangulation Primitives
PROBLEM STATEMENT
M Runtime Modeling
DELAUNAY TRIANGULATION
Result
Modeling
Prediction
ADAPTIVE SAMPLING
EMPIRICAL EVALUATION
Experiment Design
Experiment Setting
Overview of Workload Evaluation
Model Evaluation
Sampling Evaluation
More Evaluation Results
Evaluation Summary
RELATED WORK
CONCLUSIONS AND FUTURE WORK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.