Abstract

Most online service providers deploy their own data stream processing systems in the cloud to conduct large-scale and real-time data analytics. However, such systems, e.g., Apache Heron, often adopt naive scheduling schemes to distribute data streams (in the units of tuples) among processing instances, which may result in workload imbalance and system disruption. Hence, there still exists a mismatch between the temporal variations of data streams and such inflexible scheduling scheme designs. Besides, the fundamental limits of benefits of predictive scheduling to data stream processing systems remain unexplored. In this article, we focus on the problem of tuple scheduling with predictive service in Apache Heron. With a careful choice in the granularity of system modeling and decision making, we formulate the problem as a stochastic network optimization problem and propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">POTUS</i> , an online predictive scheduling scheme that aims to minimize the response time of data stream processing by steering data streams in a distributed fashion. Theoretical analysis and simulation results show that POTUS achieves an ultra-low response time with a stability guarantee. Moreover, POTUS only requires mild-value of future information to effectively reduce the response time, even with mis-prediction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call