Abstract

Real-time data processing is often a necessity as it can provide insights that have less value if discovered off-line or after the fact. However, large-scale stream processing systems are non-trivial to build and deploy. While there are many frameworks that allow users to create large-scale distributed systems, there remains many challenges in understanding the performance, cost of deployment and considerations and impact of potential (partial) outages on real-time systems performance. Our work considers the performance of Cloud-based stream processing systems in terms of back-pressure and expected utilization. The performance of an exemplar stream application is explored using different Cloud-based virtual machine resources and where the scale of deployment and cost benefits are taken into consideration in relation to the overall performance. To achieve this, we develop an algorithm based on queueing theory to predict the throughput and latency of stream data processing while supporting system stability. Our methodology for making fundamental measurements is applicable to mainstream stream processing frameworks such as Apache Storm and Heron. The method is especially suitable for large-scale distributed stream processing where jobs can run for extended time periods. We benchmark the performance of the system on the national research cloud of Australia (Nectar), and present a performance analysis based on estimating the overall effective utilization.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call