Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Lauritz Thamsen,Odej Kao,Dominik Scheinert,Jonathan Bader,Jonathan Will

doi:10.1007/s13222-022-00416-z

Lauritz Thamsen, Odej Kao + Show 3 more

Open Access

https://doi.org/10.1007/s13222-022-00416-z

Copy DOI

Abstract

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs.In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Abstract

Talk to us

Similar Papers

More From: Datenbank-Spektrum

Lead the way for us

Journal: Datenbank-Spektrum	Publication Date: May 31, 2022
License type: open-access

Similar Papers

A dynamic model for the distributed simulation of a turbojet engine
C Tournes ... B.E Wells
-
C Tournes, et. al.C Tournes ... B.E Wells
08 Mar 1998
08 Mar 1998

Queueing-based storage performance modeling and placement in OpenStack environments
Yang Song ... Rakesh Jain
-
Yang Song, et. al.Yang Song ... Rakesh Jain
01 Dec 2014
01 Dec 2014

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud
Jonathan Will ... Lauritz Thamsen
-
Jonathan Will, et. al.Jonathan Will ... Lauritz Thamsen
15 Dec 2021
15 Dec 2021

Power/Performance Modeling and Optimization: Using and Characterizing Machine Learning Applications

-

17 Oct 2018
17 Oct 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Abstract

Talk to us

Similar Papers

More From: Datenbank-Spektrum