Sliding Window Calculations on Streaming Data using the Kepler Scientific Workflow System

Sven Kohler,Supriya Gulati,Gongjing Cao,Quinn Hart,Bertram Ludascher

doi:10.1016/j.procs.2012.04.181

Abstract

In many areas of science unbounded (potentially infinite) data streams need to be processed in a continuous manner, e.g., to compute running aggregates or sliding window aggregates. One important example is the computation of Growing Degree Days (GDD) from a stream of temperature data, which provides a heuristic tool to predict plant development and the maturity of crops. The process of data acquisition, processing, storage, and presentation forms a scientific workflow and scientific workflow systems have been developed to automate their execution. The whole workflow is decomposed into its individual steps, represented by actors, which in turn are connected by channels that describe the flow of data. This workflow representation allows to reuse existing components for different workflows, and, in principle, easy modification of existing workflows. In current streaming workflow designs in Kepler, data belonging to a particular time window is typically identified by counting data tokens on channels between actors. For example, this token-counting approach does not work for windows of variable length nor for overlapping windows. In this paper, we address these limitations and present a new actor design with two incoming streams: a time-stamp ordered data stream, and a stream of aggregation windows, ordered by their start time. We present a new Chunker actor that “stream-joins” the data from one stream with the windows presented on the second stream, where windows represent aggregation intervals of variable length and possibly overlapping time. Windows containing the corresponding data are output as soon as they are completed, i.e. once timestamps in the data stream pass the end time of a window. We illustrate the approach with an improved GDD workflow based on our new Chunker actor.

Full Text