Abstract
Evaluating and predicting the performance of big data applications are required to efficiently size capacities and manage operations. Gaining profound insights into the system architecture, dependencies of components, resource demands, and configurations cause difficulties to engineers. To address these challenges, this paper presents an approach to automatically extract and transform system specifications to predict the performance of applications. It consists of three components. First, a system-and tool-agnostic domain-specific language (DSL) allows the modeling of performance-relevant factors of big data applications, computing resources, and data workload. Second, DSL instances are automatically extracted from monitored measurements of Apache Spark and Apache Hadoop (i.e., YARN and HDFS) systems. Third, these instances are transformed to model- and simulation-based performance evaluation tools to allow predictions. By adapting DSL instances, our approach enables engineers to predict the performance of applications for different scenarios such as changing data input and resources. We evaluate our approach by predicting the performance of linear regression and random forest applications of the HiBench benchmark suite. Simulation results of adjusted DSL instances compared to measurement results show accurate predictions errors below 15% based upon averages for response times and resource utilization.
Highlights
Big data frameworks are specialized to analyze data with high volume, variety, and velocity efficiently [1]
An execution node ne ∈ NE is a 5-tuple where pn is the parallelism of node; s indicates whether ne is a spout that is the node depending on partitioned data from an external source, such as a file system or messaging system; m ∈ M is a reference to the dependent data model from the Data Workload Architecture; nng ∈ NG references the parent directed node graph; and rp ∈ RP describes the Resource Profile of ne
Modeling and predicting the performance of big data applications are essential for planning capacities and evaluating configurations
Summary
Big data frameworks are specialized to analyze data with high volume, variety, and velocity efficiently [1]. We extract monitoring traces of applications (i.e., CPU times) and interrelate these with data workload information to identify parametric dependencies and estimate parametric resource demands of each execution component On this basis, performance predictions are enabled. As applications are continuously updated, DSL instances can be extracted and tracked for each release as they evolve as well This enables engineers to continuously manage and plan required capacities and evaluate the performance for different scenarios (e.g., changing data workload) by adapting model parameters. It gives detailed insights about resource demands of execution components of an application and can be used to detect performance changes and regressions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.