Defining the execution semantics of stream processing engines

Lorenzo Affetti,Gianpaolo Cugola,Alessandro Margara,Riccardo Tommasini,Emanuele Della Valle

doi:10.1186/s40537-017-0072-9

Lorenzo Affetti, Gianpaolo Cugola + Show 3 more

Open Access

https://doi.org/10.1186/s40537-017-0072-9

Copy DOI

Journal: Journal of Big Data	Publication Date: Apr 26, 2017
Citations: 16	License type: open-access

Affiliation: Politecnico di Milano

Abstract

The ability to process large volumes of data on the fly, as soon as they become available, is a fundamental requirement in today’s information systems. Modern distributed stream processing engines (SPEs) address this requirement and provide low-latency and high-throughput data stream processing in cluster platforms, offering high-level programming interfaces that abstract from low-level details such as data distribution and hardware failures. The last decade saw a rapid increase in the number of available SPEs. However, each SPE defines its own processing model and standardized execution semantics have not emerged yet. This paper tackles this problem and analyzes the execution semantics of some widely adopted modern SPEs, namely Flink, Storm, Spark Streaming, Google Dataflow, and Azure Stream Analytics. We specifically target the notions of windowing and time, traditionally considered the key distinguishing factors that characterize the behavior of SPEs. We rely on the SECRET model, introduced in 2010 to analyze the windowing semantics for the SPEs available at that time. We show that SECRET models well some aspects of the behavior of modern SPEs, and we shed light on the evolution of SPEs after the introduction of SECRET by analyzing the elements that SECRET cannot fully capture. In this way, the paper contributes to the research in the area of stream processing by: (1) contrasting and comparing some widely used modern SPEs based on a formal model of their execution semantics; (2) discussing the evolution of SPEs since the introduction of the SECRET model; (3) suggesting promising research directions to direct further modeling efforts.

Highlights

Several modern data-intensive applications need to process large volumes of data on the fly as they are produced
This paper aims to answer such questions by using SECRET to model five distributed stream processing engines (SPEs)—Flink, Storm, Spark Streaming, Google Dataflow, and Azure Stream Analytics— that were developed after the introduction of SECRET and are today widely used in companies at the scale of Google, Twitter, and Netflix
The SECRET model we adopt in this paper focuses on the semantics of windows in the case of event time

Summary

Introduction

Several modern data-intensive applications need to process large volumes of data on the fly as they are produced. Stream processing is a central requirement in today’s information systems This state of facts pushed the development of several stream processing engines (SPEs) that continuously analyze streams of data to produce new results as new elements enter the streams. SPEs receive input streams from one or more sources—grey diamonds in Fig. 1—and organize the computation into a directed graph of operators—white circles and boxes in Fig. 1—either explicitly or implicitly. In the latter case, the developers are provided with high-level languages that are automatically translated by the SPE into the operator graph. Organizing the computation into separate operators enables for task parallelism—different operators run on different threads on the same machine, or on different machines—, while replication enables for data parallelism—different portions of an input stream are processed in parallel on different instances of the same operator

Objectives

Methods

Discussion

Conclusion