Selecting resources for distributed dataflow systems according to runtime targets

Lauritz Thamsen,Florian Schmidt,Thomas Renner,Ilya Verbitskiy,Odej Kao

doi:10.1109/pccc.2016.7820629

Abstract

Distributed dataflow systems like Spark or Flink enable users to analyze large datasets. Users create programs by providing sequential user-defined functions for a set of well-defined operations, select a set of resources, and the systems automatically distribute the jobs across these resources. However, selecting resources for specific performance needs is inherently difficult and users consequently tend to overprovision, which results in poor cluster utilization. At the same time, many important jobs are executed recurringly in production clusters. This paper presents Bell, a practical system that monitors job execution, models the scale-out behavior of jobs based on previous runs, and selects resources according to user-provided runtime targets. Bell automatically chooses between different runtime prediction models to optimally support different distributed dataflow systems. Bell is implemented as a job submission tool for YARN and, thus, works with existing cluster setups. We evaluated Bell's runtime prediction with six exemplary data analytics jobs using both Spark and Flink. We present the learned scale-out models for these jobs and evaluate the relative prediction error using cross-validation, showing that our model selection approach provides better overall performance than the individual prediction models.

Full Text