Abstract

Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this article is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This article describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis.

Highlights

  • We are witnessing the exponential growth of data available from different kinds of sources often producing information in the form of streams [1], i.e. unbounded sequences of data items received at variable speed

  • First-generation Data Stream Processing Systems (DSPSs) like Aurora [2], Borealis [3], STREAM [4], and StreamIt [5] have been originated from the Database community, and are designed to execute relational algebra queries on data streams rather than on finite and permanent relations

  • The main contribution of this paper is to provide a new benchmark suite of 15 applications coming from different areas and all needing the processing features offered by modern DSPSs

Read more

Summary

Introduction

We are witnessing the exponential growth of data available from different kinds of sources (e.g., sensors, financial tickers, social media) often producing information in the form of streams [1], i.e. unbounded sequences of data items received at variable speed. First-generation Data Stream Processing Systems (DSPSs) like Aurora [2], Borealis [3], STREAM [4], and StreamIt [5] have been originated from the Database community, and are designed to execute relational algebra queries on data streams rather than on finite and permanent relations (tables). This experience has opened the domain of streaming analytics, with tools. They target streaming applications in the domain

Objectives
Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call