SMiPE: Estimating the Progress of Recurring Iterative Distributed Dataflows

Jannis Koch,Florian Schmidt,Lauritz Thamsen,Odej Kao

doi:10.1109/pdcat.2017.00034

Abstract

Distributed dataflow systems such as Apache Spark allow the execution of iterative programs at large scale on clusters. In production use, programs are often recurring and have strict latency requirements. Yet, choosing appropriate resource allocations is difficult as runtimes are dependent on hard-to-predict factors, including failures, cluster utilization and dataset characteristics. Offline runtime prediction helps to estimate resource requirements, but cannot take into account inherent variance due to, for example, changing cluster states. We present SMiPE, a system estimating the progress of iterative dataflows by matching a running job to previous executions based on similarity, capturing properties such as convergence, hardware utilization and runtime. SMiPE is not limited to a specific framework due to its black-box approach and is able to adapt to changing cluster states reflected in the current job’s statistics. SMiPE automatically adapts its similarity matching to algorithm-specific profiles by training parameters on the job history. We evaluated SMiPE with three iterative Spark jobs and nine datasets. The results show that SMiPE is effective in choosing useful historic runs and predicts runtimes with a mean relative error of 9.1% to 13.1%.

Full Text