Abstract
More and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study and benchmark the scalability of stream processing jobs in four popular frameworks: Flink, Kafka Streams, Spark Streaming, and Structured Streaming. Besides that, we determine the factors that influence the performance and efficiency of scaling processing jobs with distinct characteristics. We evaluate horizontal, as well as vertical scalability. Our results show how the scaling efficiency is impacted by many factors including the initial cluster layout and direction of scaling, the pipeline design, the framework design, resource allocation, and data characteristics. Finally, we give some recommendations on how practitioners should undertake to scale their clusters.
Highlights
N EAR real-time processing has become increasingly important with the rise of new domains such as IoT
We evaluate the hypothesis that the scalability of a processing job is influenced by many factors that together create the throughput bottleneck
The bottleneck of Kafka Streams is in the maintenance of its RocksDB state backend
Summary
The use cases in these domains often require processing large volumes of data at a high velocity. It becomes necessary to distribute processing over several machines. This heavily increases the complexity of an application in many aspects, e.g. fault recovery [1], accuracy [2], state management [3]. To make abstraction of this complexity, several stream processing frameworks were developed. These frameworks can be used as a generic system to implement a large range of use cases in a distributed fashion. Horizontal scaling implies an increase in the number of workers, while vertical scaling implies an increase in the resources assigned to the workers
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.