Stream Processing Systems Research Articles

Distributed architectures for efficient processing of streaming data are increasingly critical to modern information processing systems. The goal of this paper is to develop type-based programming abstractions that facilitate correct and efficient deployment of a logical specification of the desired computation on such architectures. In the proposed model, each communication link has an associated type specifying tagged data items along with a dependency relation over tags that captures the logical partial ordering constraints over data items. The semantics of a (distributed) stream processing system is then a function from input data traces to output data traces, where a data trace is an equivalence class of sequences of data items induced by the dependency relation. This data-trace transduction model generalizes both acyclic synchronous data-flow and relational query processors, and can specify computations over data streams with a rich variety of partial ordering and synchronization characteristics. We then describe a set of programming templates for data-trace transductions: abstractions corresponding to common stream processing tasks. Our system automatically maps these high-level programs to a given topology on the distributed implementation platform Apache Storm while preserving the semantics. Our experimental evaluation shows that (1) while automatic parallelization deployed by existing systems may not preserve semantics, particularly when the computation is sensitive to the ordering of data items, our programming abstractions allow a natural specification of the query that contains a mix of ordering constraints while guaranteeing correct deployment, and (2) the throughput of the automatically compiled distributed code is comparable to that of hand-crafted distributed implementations.

Today's stream processing systems handle high-volume data streams in an efficient manner. To achieve this goal, they are designed to scale out on large clusters of commodity machines. However, despite the efficient use of distributed architectures, they lack support to co-processors like graphical processing units (GPUs) ready to accelerate data-parallel tasks. The main reason for this lack of integration is that GPU processing and the streaming paradigm have different processing models, with GPUs needing a bulk of data present at once while the streaming paradigm advocates a tuple-at-a-time processing model. This paper contributes to fill this gap by proposing Gasser, a system for offloading the execution of sliding-window operators on GPUs. The system focuses on completely general functions by targeting the parallel processing of non-incremental queries that are not supported by the few existing GPU-based streaming prototypes. Furthermore, Gasser provides an auto-tuning approach able to automatically find the optimal value of the configuration parameters (i.e., batch length and the degree of parallelism) needed to optimize throughput and latency with the given query and data stream. The experimental part assesses the performance efficiency of Gasser by comparing its peak throughput and latency against Apache Flink, a popular and scalable streaming system. Furthermore, we evaluate the penalty induced by supporting completely general queries against the performance achieved by the state-of-the-art solution specifically optimized for incremental queries. Finally, we show the speed and accuracy of the auto-tuning approach adopted by Gasser, which is able to self-configure the system by finding the right configuration parameters without manual tuning by the users.

Stream Processing Systems Research Articles

Related Topics

Articles published on Stream Processing Systems

On the performance and convergence of distributed stream processing via approximate fault tolerance

A QoS-Latency Aware Event Stream Processing with Elastic-FaaS

Counting the frequency of time-constrained serial episodes in a streaming sequence

Topology-aware task allocation for online distributed stream processing applications with latency constraints

Online template induction for machine-generated emails

Pec: Proactive Elastic Collaborative Resource Scheduling in Data Stream Processing

An optimal checkpointing model with online OCI adjustment for stream processing applications

Data-Trace Types for Distributed Stream Processing Systems.

High availability of data using Automatic Selection Algorithm (ASA) in distributed stream processing systems

Reliable stream data processing for elastic distributed stream processing systems

A Comprehensive Survey on Parallelization and Elasticity in Stream Processing

PASCAL: An architecture for proactive auto-scaling of distributed services

Minimizing cost by reducing scaling operations in distributed stream processing

Towards Low-Latency Batched Stream Processing by Pre-Scheduling

Property-Based Testing for Spark Streaming

Stream Data Load Prediction for Resource Scaling Using Online Support Vector Regression

Performance prediction of data streams on high-performance architecture

Integrating workload balancing and fault tolerance in distributed stream processing system

Raising the Parallel Abstraction Level for Streaming Analytics Applications

GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Stream Processing Systems Research Articles

Related Topics

Articles published on Stream Processing Systems

On the performance and convergence of distributed stream processing via approximate fault tolerance

A QoS-Latency Aware Event Stream Processing with Elastic-FaaS

Counting the frequency of time-constrained serial episodes in a streaming sequence

Topology-aware task allocation for online distributed stream processing applications with latency constraints

Online template induction for machine-generated emails

Pec: Proactive Elastic Collaborative Resource Scheduling in Data Stream Processing

An optimal checkpointing model with online OCI adjustment for stream processing applications

Data-Trace Types for Distributed Stream Processing Systems.

High availability of data using Automatic Selection Algorithm (ASA) in distributed stream processing systems

Reliable stream data processing for elastic distributed stream processing systems

A Comprehensive Survey on Parallelization and Elasticity in Stream Processing

PASCAL: An architecture for proactive auto-scaling of distributed services

Minimizing cost by reducing scaling operations in distributed stream processing

Towards Low-Latency Batched Stream Processing by Pre-Scheduling

Property-Based Testing for Spark Streaming

Stream Data Load Prediction for Resource Scaling Using Online Support Vector Regression

Performance prediction of data streams on high-performance architecture

Integrating workload balancing and fault tolerance in distributed stream processing system

Raising the Parallel Abstraction Level for Streaming Analytics Applications

GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs