Apache Storm Research Articles

Like other emerging fields, Stream Processing Engines (SPEs) pose several challenges to the researchers such as resource awareness, dynamic configurations, heterogeneous clusters, and load balancing. All of these aspects play a major role in the job scheduling process. Inefficiency in any of them causes problems for achieving the maximum throughput. SPEs must contemplate other aspects like resource provisioning, job’s computation requirement, physical distance between communicating nodes, etc. Currently, SPEs ignore topology’s structure as well as inter-executor traffic while scheduling. Due to this, frequently communicating tasks may end up at different computing nodes which increases network latency. In this paper, A3-Storm, a scheduler, based on topology and traffic is proposed that optimizes resource usage for heterogeneous clusters. The aim is to improve efficiency using resource-aware task assignments that results in enhanced throughput and resource utilization. A3-Storm schedules topology using inter-executor traffic and supervisor node’s computing power. A3-Storm is divided into two phases: in the first phase, executors are logically grouped to minimize inter-group communication traffic according to the topology structure or inter-executor traffic. In the second phase, these groups are assigned to physical nodes starting from the most powerful node. Apache Storm (a popular open-source SPE) is used for the implementation of A3-Storm. Results are generated with the help of 2 benchmark topologies, and results are compared with 3 state-of-the-art algorithms. Extensive experiment results show up to 25% and 12% improvement in throughput as compared to the default Storm scheduler and resource-aware scheduler, respectively, with a significant amount of resource savings through consolidation.

Read full abstract

Distributed architectures for efficient processing of streaming data are increasingly critical to modern information processing systems. The goal of this paper is to develop type-based programming abstractions that facilitate correct and efficient deployment of a logical specification of the desired computation on such architectures. In the proposed model, each communication link has an associated type specifying tagged data items along with a dependency relation over tags that captures the logical partial ordering constraints over data items. The semantics of a (distributed) stream processing system is then a function from input data traces to output data traces, where a data trace is an equivalence class of sequences of data items induced by the dependency relation. This data-trace transduction model generalizes both acyclic synchronous data-flow and relational query processors, and can specify computations over data streams with a rich variety of partial ordering and synchronization characteristics. We then describe a set of programming templates for data-trace transductions: abstractions corresponding to common stream processing tasks. Our system automatically maps these high-level programs to a given topology on the distributed implementation platform Apache Storm while preserving the semantics. Our experimental evaluation shows that (1) while automatic parallelization deployed by existing systems may not preserve semantics, particularly when the computation is sensitive to the ordering of data items, our programming abstractions allow a natural specification of the query that contains a mix of ordering constraints while guaranteeing correct deployment, and (2) the throughput of the automatically compiled distributed code is comparable to that of hand-crafted distributed implementations.

Read full abstract

Apache Storm Research Articles

Related Topics

Articles published on Apache Storm

Throughput optimization for Storm-based processing of stream data on clouds

A3-Storm: topology-, traffic-, and resource-aware storm scheduler for heterogeneous clusters

Job scheduler for streaming applications in heterogeneous distributed processing systems

Pattern analysis based data management method and memory-disk integrated system for high performance computing

Modeling Access Control on Streaming Data in Apache Storm

DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing Systems

Improved k-Anonymize and l-Diverse Approach for Privacy Preserving Big Data Publishing Using MPSEC Dataset

RCD+: A Partitioning Method for Data Streams Based on Multiple Queries

Performance-aware deployment of streaming applications in distributed stream computing systems

Pipeline-Based Linear Scheduling of Big Data Streams in the Cloud

Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics

Big Data Tools and Techniques: A Roadmap for Predictive Analytics

Performance Assay of Big IoT Data Analytics Framework

Use case-based evaluation of workflow optimization strategy in real-time computation system

Efficient Time-Evolving Stream Processing at Scale

Content Based Video Retrieval by Using Distributed Real-Time System Based on Storm

Real-time big data processing for instantaneous marketing decisions: A problematization approach

On SDN-Enabled Online and Dynamic Bandwidth Allocation for Stream Analytics

SPMgr: Dynamic workflow manager for sampling and filtering data streams over Apache Storm

Data-Trace Types for Distributed Stream Processing Systems.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Apache Storm Research Articles

Related Topics

Articles published on Apache Storm

Throughput optimization for Storm-based processing of stream data on clouds

A3-Storm: topology-, traffic-, and resource-aware storm scheduler for heterogeneous clusters

Job scheduler for streaming applications in heterogeneous distributed processing systems

Pattern analysis based data management method and memory-disk integrated system for high performance computing

Modeling Access Control on Streaming Data in Apache Storm

DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing Systems

Improved k-Anonymize and l-Diverse Approach for Privacy Preserving Big Data Publishing Using MPSEC Dataset

RCD+: A Partitioning Method for Data Streams Based on Multiple Queries

Performance-aware deployment of streaming applications in distributed stream computing systems

Pipeline-Based Linear Scheduling of Big Data Streams in the Cloud

Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics

Big Data Tools and Techniques: A Roadmap for Predictive Analytics

Performance Assay of Big IoT Data Analytics Framework

Use case-based evaluation of workflow optimization strategy in real-time computation system

Efficient Time-Evolving Stream Processing at Scale

Content Based Video Retrieval by Using Distributed Real-Time System Based on Storm

Real-time big data processing for instantaneous marketing decisions: A problematization approach

On SDN-Enabled Online and Dynamic Bandwidth Allocation for Stream Analytics

SPMgr: Dynamic workflow manager for sampling and filtering data streams over Apache Storm

Data-Trace Types for Distributed Stream Processing Systems.