Abstract

AbstractRecently distributed stream processors are increasingly being deployed in cloud computing infrastructures. In this article, we study performance characteristics of distributed stream processing applications in Google Compute Engine which is based on Kubernetes. We identify performance gaps in terms of throughput which appear in such environments when using a round robin (RR) scheduling algorithm. As a solution, we propose resource aware stream processing scheduler called resource aware scheduler for stream processing applications in cloud native environments (RaspaCN). We implement RaspaCN's job scheduler using two‐step process. First, we use machine learning to identify the optimal number of worker nodes. Second, we use RR and multiple Knapsack algorithms to produce performance optimal stream processing job schedules. With three application benchmarks called HTTP Log Processor, Nexmark, and Email Processor representing real world stream processing scenarios we evaluate the performance benefits obtained via RaspaCN's scheduling algorithm. RaspaCN could produce percentage increase of average throughput values by at least 37%, 38%, and 10%, respectively, for HTTP Log Processor, Nexmark, and Email Processor benchmarks for fixed input data rates. Furthermore, we conduct experiments with varying input data rates as well and show 7% improved average throughput for HTTP Log Processor. These experiments show the effectiveness of our proposed stream processor job scheduler for producing improved performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call