Abstract

Distributed data processing systems have become the standard means for big data analytics. These systems are based on processing pipelines where operations on data are performed in a chain of consecutive steps. Normally, the operations performed by these pipelines are set at design time, and any changes to their functionality require the applications to be restarted. This is not always acceptable, for example, when we cannot afford downtime or when a long-running calculation would lose significant progress. The introduction of variation points to distributed processing pipelines allows for on-the-fly updating of individual analysis steps. In this paper, we extend such basic variation point functionality to provide fully automated reconfiguration of the processing steps within a running pipeline through an automated planner. We have enabled pipeline modeling through constraints. Based on these constraints, we not only ensure that configurations are compatible with type but also verify that expected pipeline functionality is achieved. Furthermore, automating the reconfiguration process simplifies its use, in turn allowing users with less development experience to make changes. The system can automatically generate and validate pipeline configurations that achieve a specified goal, selecting from operation definitions available at planning time. It then automatically integrates these configurations into the running pipeline. We verify the system through the testing of a proof-of-concept implementation. The proof of concept also shows promising results when reconfiguration is performed frequently.

Highlights

  • Industrial organizations are increasingly dependent on the digital components of their business

  • In a previous work done by the authors (Lazovik et al, 2017), we have investigated the feasibility of dynamically updating the processing pipeline of an Apache Spark application

  • From the action variables defined on the transitions in the Constraint Satisfaction Problem (CSP), we can extract the actions assigned on those transitions, which correspond to user code that should be assigned to variation points

Read more

Summary

INTRODUCTION

Industrial organizations are increasingly dependent on the digital components of their business. We have developed a framework sparkdynamic (Lazovik et al, 2017), built on top of the popular distributed data processing platform Apache Spark (The Apache Software Foundation, 2015b) to enable the updating of the steps and algorithm parameters of running pipelines without restarting them The main contribution of this paper is a distributed data processing pipeline reconfiguration framework based on constraint-based AI planning It ensures that the current industrial user goals are satisfied, takes into account the dependencies between related steps within the pipeline (and ensuring its data type and structural consistency), and automatically incorporates the new configuration.

Runtime Updating
Updating Distributed Data Processing Pipelines
Spark-Dynamic
Techniques for Building and Checking Pipelines
GENERAL OVERVIEW
PLANNER DESIGN FOR PIPELINE RECONFIGURATION
Core Planning Model
Mapping to the Distributed Pipeline
Planner Representation as CSP
Planning Model Justification
EVALUATION
Plan Generation Time
Dynamic Versus Static
CONCLUSION AND DISCUSSION
DATA AVAILABILITY STATEMENT
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call