A rewrite-based optimizer for Spark

Zeinab Shmeis,Mohamad Jaber

doi:10.1016/j.future.2019.03.044

Abstract

Spark is the leading platform for distributed large-scale data processing. Spark’s Application Programming Interface (API) has a powerful easy-to-use distributed abstractions similarly related to functional programming (e.g., map, filter, reduce) in several different languages. However, writing an efficient Spark applications is still error-prone, time-consuming, and requires a clear and deep understanding of the inner-workings of Spark. For instance, the same task can be implemented in several different ways, yet the execution time can vary drastically between them. For this, we introduce TaBOS, a rewrite-based optimizer for Spark programs. TaBOS takes a Spark job and automatically generates a state-space of equivalent optimized jobs using a set of semantics-preserving rewrite rules. Then, from the generated state-space, it selects one optimal program based on a predefined strategy. We introduce several selection strategies (e.g., job with maximum number of applied rewrite rules, job with minimum number of heavy operations) for identifying an optimal job from the generated state-space. We evaluate the effectiveness, robustness and speedup gain of our solutions using several case studies.

Full Text