PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop

Johannes Kroß,Helmut Krcmar

doi:10.3390/bdcc3030047

Abstract

Evaluating and predicting the performance of big data applications are required to efficiently size capacities and manage operations. Gaining profound insights into the system architecture, dependencies of components, resource demands, and configurations cause difficulties to engineers. To address these challenges, this paper presents an approach to automatically extract and transform system specifications to predict the performance of applications. It consists of three components. First, a system-and tool-agnostic domain-specific language (DSL) allows the modeling of performance-relevant factors of big data applications, computing resources, and data workload. Second, DSL instances are automatically extracted from monitored measurements of Apache Spark and Apache Hadoop (i.e., YARN and HDFS) systems. Third, these instances are transformed to model- and simulation-based performance evaluation tools to allow predictions. By adapting DSL instances, our approach enables engineers to predict the performance of applications for different scenarios such as changing data input and resources. We evaluate our approach by predicting the performance of linear regression and random forest applications of the HiBench benchmark suite. Simulation results of adjusted DSL instances compared to measurement results show accurate predictions errors below 15% based upon averages for response times and resource utilization.

Highlights

Big data frameworks are specialized to analyze data with high volume, variety, and velocity efficiently [1]
An execution node ne ∈ NE is a 5-tuple where pn is the parallelism of node; s indicates whether ne is a spout that is the node depending on partitioned data from an external source, such as a file system or messaging system; m ∈ M is a reference to the dependent data model from the Data Workload Architecture; nng ∈ NG references the parent directed node graph; and rp ∈ RP describes the Resource Profile of ne
Modeling and predicting the performance of big data applications are essential for planning capacities and evaluating configurations

Summary

Introduction

Big data frameworks are specialized to analyze data with high volume, variety, and velocity efficiently [1]. We extract monitoring traces of applications (i.e., CPU times) and interrelate these with data workload information to identify parametric dependencies and estimate parametric resource demands of each execution component On this basis, performance predictions are enabled. As applications are continuously updated, DSL instances can be extracted and tracked for each release as they evolve as well This enables engineers to continuously manage and plan required capacities and evaluate the performance for different scenarios (e.g., changing data workload) by adapting model parameters. It gives detailed insights about resource demands of execution components of an application and can be used to detect performance changes and regressions.

Related Work

Formalism

Application Execution Architecture

Resource Profile

Data Workload Architecture

Resource Architecture

PerTract-DSL

Configuration defaultParallelism: EInt executors: EInt taskSlotsPerExecutor

ClusterSpecification spec

Extraction of Resource Demands

Extraction and Estimation of Resource Profiles

Extraction of Data Workload Architectures

Extraction of Resource Architectures

Palladio Component Model

Transformation to PCM

Research Methodology

HiBench Benchmark Suite

Experiment Setup

Collecting Resource Demands and Extracting Execution Architectures

Evaluating Data Workload Changes

Evaluating Resource Changes

Evaluating Data Workload and Resource Changes

Threats to Validity

Assumptions and Limitations

Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Big Data and Cognitive Computing	Publication Date: Aug 9, 2019
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Big Data and Cognitive Computing

Lead the way for us

Similar Papers

Energy-Efficient Speculative Execution using Advanced Reservation for Heterogeneous Clusters
Amelie Chi Zhou ... Bingsheng He
-
Amelie Chi Zhou, et. al.Amelie Chi Zhou ... Bingsheng He
13 Aug 2018
13 Aug 2018

A Study on the Causes of Garbage Collection in Java for Big Data Workloads
Aiswarya Sriram ... Alka Simon
-
Aiswarya Sriram, et. al.Aiswarya Sriram ... Alka Simon
10 Dec 2020
10 Dec 2020

Predicting the performance of big data applications on the cloud
D Ardagna ... A P C Da Silva
The Journal of Supercomputing | VOL. 77
D Ardagna, et. al.D Ardagna ... A P C Da Silva
15 May 2020
The Journal of Supercomputing | VOL. 77

Performance analysis model for big data applications in cloud computing
Luis Eduardo Bautista Villalpando ... Alain Abran
Journal of Cloud Computing | VOL. 3
Luis Eduardo Bautista Villalpando, et. al.Luis Eduardo Bautista Villalpando ... Alain Abran
01 Dec 2014
Journal of Cloud Computing | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Big Data and Cognitive Computing