Apache Storm Research Articles

As one of the most computationally intensive operations in stream processing applications, join operation can cause severe load imbalance problem when dealing with skewed data. Most of the popular solutions focused on monitoring-based dynamic balancing strategies, making it difficult to quickly adapt to the changing frequency of data stream, and sometimes failing the balancing strategies that try to address the skewed load in the cluster. To address these issues, we propose to use the prediction results of a deep reinforcement learning model and adjust the grouping strategy in advance before the frequency change of data stream. It will enable the system to quickly adapt to data stream fluctuation, while managing the resources for effective resource utilization. The following contributions are made in this paper: 1) Explore the main factors that trigger the load skewness problem in distributed stream join systems and carefully model the load balancing problem at the application level. 2) Develop a Gated Recurrent Unit Sequence to Sequence model to predict key frequency distribution of streams, and propose a dynamic grouping algorithm and a feedback-based resource elasticity scaling algorithm to solve the load imbalance problem caused by hot keys in real time. 3) Design and implement an adaptive stream join system Aj-Stream based on the prediction model and the proposed algorithm on Apache Storm. 4) Evaluate the system performance through extensive experiments on a large scale real-world dataset and multiple synthetic datasets. The experimental results demonstrate that the Aj-Stream proposed in this paper exhibits stable throughput and latency performance with both static data streams of varying skewnesses and dynamic data streams. In comparison to existing stream-connected systems, Aj-Stream demonstrated a 22.1% increase in system throughput and a 45.5% decrease in system latency when dealing with frequently fluctuating data streams.

Read full abstract

Today, large-scale cloud organizations are deploying datacenters and “edge” clusters globally to provide their users with low-latency access to their services. Running stream applications across these geo-distributed sites are emerging as a daily requirement, such as making business decisions from marketing streams, identifying spam campaigns from social network streams, and analyzing existing genomes in different labs and countries to track the sources of a potential epidemic. However, while the progress has been encouraging, the existing efforts have dominantly centered around <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">stateless stream processing</i> , leaving another urgent trend- <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">stateful stream processing</i> -much less explored. A driving need is that next-generation stream processing systems need to store and update states during processing, and most importantly, successfully recover large distributed states when faults and failures happen. Existing studies exhibit major limitations including: (1) they mostly inherit MapReduce's “single master/many workers” architecture, where the central master is responsible for all scheduling activities and easily becomes a scalability bottleneck; (2) they offer state recovery mainly through the use of three approaches: replication recovery, checkpointing recovery, and DStream-based lineage recovery, which are either slow, resource-expensive or failing to handle multiple failures; and (3) they are not adaptive to heterogeneous hardware settings in the cloud. In this paper, we present A-FP4S, a novel adaptive fragments-based parallel state recovery mechanism for stream processing systems to manage and recover large distributed states for a massive number of stream applications. The novelty of A-FP4S is that we organize stream operators into a distributed hash table (DHT) based peer-to-peer (P2P) overlay. Then we divide each node's local state into many fragments and periodically store them in each node's multiple neighbors (the leaf set nodes of DHT), ensuring that different sets of available fragments can reconstruct failed states in parallel. By doing that, this failure recovery mechanism is extremely scalable to the size of the lost state, significantly reduces the failure recovery time, and can tolerate multiple node failures. A-FP4S is adaptive to heterogeneous hardware settings (e.g., CPU speed, disk/file-system speed, network bandwidth) by automatic parameter tuning over phases. Compared to Apache Storm, A-FP4S achieves a significant 31.8% to 50.5% reduction in recovery latency. It can scale to many simultaneous failures and successfully recover the state, even more than half of the operators fail or get lost. Large-scale experiments using real-world datasets demonstrate A-FP4S's attractive scalability and adaptivity properties.

Read full abstract

Apache Storm Research Articles

Related Topics

Articles published on Apache Storm

Stream-aware indexing for distributed inequality join processing

Wearable Smart Cardiac Care: Take Vipasyana for Example

Big Data Analytic Tools Usage among Academic Libraries in Tanzania

Big Data Analytic Tools Usage among Academic Libraries in Tanzania

Adaptive key partitioning in distributed stream processing

Bibliometric behavior of big data and digital marketing as real-time multimedia

SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams

An adaptive load balancing strategy for stateful join operator in skewed data stream environments

General-purpose data stream processing on heterogeneous architectures with WindFlow

Adaptive Fragment-Based Parallel State Recovery for Stream Processing Systems

A COMPREHENSIVE STUDY ON BIG DATA FRAMEWORKS

TWITTER SENTIMENTAL ANALYSIS

SDN-enabled Resource Provisioning Framework for Geo-Distributed Streaming Analytics

IRONEDGE: Stream Processing Architecture for Edge Applications

STORMING DATA IN THE CIRCULAR ECONOMY

Development of infrastructure for anomalies detectionin big data

Delay-Resistant Geo-Distributed Analytics

JORA: Blockchain-based efficient joint computing offloading and resource allocation for edge video streaming systems

Network-aware worker placement for wide-area streaming analytics

Balanced Schedule on Storm for Performance Enhancement

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Apache Storm Research Articles

Related Topics

Articles published on Apache Storm

Stream-aware indexing for distributed inequality join processing

Wearable Smart Cardiac Care: Take Vipasyana for Example

Big Data Analytic Tools Usage among Academic Libraries in Tanzania

Big Data Analytic Tools Usage among Academic Libraries in Tanzania

Adaptive key partitioning in distributed stream processing

Bibliometric behavior of big data and digital marketing as real-time multimedia

SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams

An adaptive load balancing strategy for stateful join operator in skewed data stream environments

General-purpose data stream processing on heterogeneous architectures with WindFlow

Adaptive Fragment-Based Parallel State Recovery for Stream Processing Systems

A COMPREHENSIVE STUDY ON BIG DATA FRAMEWORKS

TWITTER SENTIMENTAL ANALYSIS

SDN-enabled Resource Provisioning Framework for Geo-Distributed Streaming Analytics

IRONEDGE: Stream Processing Architecture for Edge Applications

STORMING DATA IN THE CIRCULAR ECONOMY

Development of infrastructure for anomalies detectionin big data

Delay-Resistant Geo-Distributed Analytics

JORA: Blockchain-based efficient joint computing offloading and resource allocation for edge video streaming systems

Network-aware worker placement for wide-area streaming analytics

Balanced Schedule on Storm for Performance Enhancement