Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems

Syed Muhammad Abrar Akber,Hai Jin,Yonghui Wang,Hanhua Chen

doi:10.1109/cloudnet.2018.8549548

Abstract

Failure occurrence in large-scale systems is inevitable, which makes the resilience a key challenge for modern systems. Checkpoints with rollback recovery is a well-known approach to provide fault tolerance in distributed systems. The checkpoint based fault tolerance approach periodically persists the application state to reliable storage, which serves as a recovery point in case of failure. These periodic checkpoints are not inline with the failure rate of the systems as many studies conclude that failure occurrence is not periodic. The optimal size of checkpoint interval is a crucial decision, which directly determines the checkpoint overheads. To minimize the checkpoint overheads, we propose to reduce the number of checkpoints during the application execution. We suggest reducing the number of checkpoints by successively increasing the checkpoint intervals. We consider the failure probability of the underlying infrastructure and iteratively increase the checkpoint intervals. The proposed checkpoint approach tailors the checkpoint initializing based on the failure probability. If failure probability is low, it increases the checkpoint interval, and eventually reduces the total number of checkpoints triggered during application timespan. Reducing the total number of checkpoints during application execution results in decreasing the checkpoint overheads. The experiment results show that the proposed checkpoint policy considerably reduces the checkpoint overheads as compared to periodic checkpoints.

Full Text