Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications

Zhan Zhang,Xian Liu,Wenhao Li,Xiao Qing,Hongwei Liu

doi:10.1007/s11036-020-01729-7

Abstract

Nowadays various distributed stream processing systems (DSPSs) are employed to process the ever-expanding real-time data. The DSPSs are highly susceptible to system failure, and the fault-tolerance issue is a major problem, which is getting lot of attention nowadays. Flink is a popular streaming computing framework that implements a lightweight, asynchronous checkpoint technique based on the barrier mechanism to ensure high efficiency in analysing the data. In a checkpoint-based fault-tolerance mechanism, a shorter checkpoint interval can increase runtime cost of streaming applications, while a longer one will increase recovery time of failure recovery. So, selecting an optimal checkpoint interval is critical to attain high efficiency of the streaming applications. Traditional optimal checkpoint interval mechanisms usually assume that the checkpointing delay and the fault recovery time are fixed. However, both factors have a strong relation to the intensity of the application’s workload. To obtain more optimal checkpoint interval under different workload intensities, this paper proposes a performance model to estimate the tuples processing latency and a recovery model to estimate the fault recovery time. With these two models, an optimal checkpoint interval can be arrived. These models and the interval optimisation interval are verified experimentally on Flink. The results show that the proposed model can recommend an optimal checkpoint interval according to the system reliability related indicators. This proposed system optimised recovery time and performs efficiently in applications with delay constraints.

Full Text