Abstract
Stream processing has been widely used in big data analytics because it provides real-time information on continuously incoming data streams with low latency. As the volume of data increases and the processing logic becomes more complicated, the size of internal states in stream processing applications also increases. To deal with large states efficiently, modern stream processing systems support storing internal states on solid state drives (SSDs) by utilizing persistent key-value (KV) stores optimized for SSDs. For example, Apache Flink and Apache Samza store internal states on RocksDB. However, delegating state management to persistent KV stores degrades the performance, because the KV stores cannot optimize their state management strategies according to stream query semantics as they are not aware of the query semantics. In this paper, we investigate the performance limitations of current state management approaches on SSDs and show that query-aware optimizations can significantly improve the performance of stateful query processing on SSDs. Based on our observation, we propose a new stream processing system design with static and runtime query-aware optimizations. We also discuss additional research directions on integrating emerging storage technologies with stateful stream processing.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have